作者热门文章
- android - RelativeLayout 背景可绘制重叠内容
- android - 如何链接 cpufeatures lib 以获取 native android 库?
- java - OnItemClickListener 不起作用,但 OnLongItemClickListener 在自定义 ListView 中起作用
- java - Android 文件转字符串
我有 3 个数据集,每个数据集有 415 GB 的数据并且属于不同的域。
我需要使用 pig 将它们全部联合起来,但我只能使用它的 union 子句,该子句在作业结束时启动 reducer 以删除不同的值。
a = union a1, a2
data = union a, a3
有没有办法跳过 reducer 部分,因为数据已经不同了。
最佳答案
来自 UNION
上的文档:
Use the UNION operator to merge the contents of two or more relations. The UNION operator:
- Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.
- Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.
- Does not eliminate duplicate tuples.
重点是我的。这向我表明不需要缩减器步骤来完成 UNION
因为它不需要删除重复的行。您确定 reducer 作业是 UNION
的结果吗?这可能是另一个运算符(operator)的结果。
奖励:您可以将示例简化为:
B = UNION a1, a2, a3 ;
关于hadoop - 如何在 pig 中实现 Union All?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29986092/
我是一名优秀的程序员,十分优秀!