hadoop - 如何在 pig 中实现 Union All？-6ren

hadoop - 如何在 pig 中实现 Union All？

转载作者：可可西里更新时间：2023-11-01 16:11:28

我有 3 个数据集，每个数据集有 415 GB 的数据并且属于不同的域。

我需要使用 pig 将它们全部联合起来，但我只能使用它的 union 子句，该子句在作业结束时启动 reducer 以删除不同的值。

a = union a1, a2
data = union a, a3

有没有办法跳过 reducer 部分，因为数据已经不同了。

最佳答案

来自 UNION 上的文档:

Use the UNION operator to merge the contents of two or more relations. The UNION operator:

Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.

Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.

Does not eliminate duplicate tuples.

重点是我的。这向我表明不需要缩减器步骤来完成 UNION 因为它不需要删除重复的行。您确定 reducer 作业是 UNION 的结果吗？这可能是另一个运算符(operator)的结果。

奖励:您可以将示例简化为: