gpt4 book ai didi

hadoop - 如何在 pig 中实现 Union All?

转载 作者:可可西里 更新时间:2023-11-01 16:11:28 26 4
gpt4 key购买 nike

我有 3 个数据集,每个数据集有 415 GB 的数据并且属于不同的域。

我需要使用 pig 将它们全部联合起来,但我只能使用它的 union 子句,该子句在作业结束时启动 reducer 以删除不同的值。

a = union a1, a2
data = union a, a3

有没有办法跳过 reducer 部分,因为数据已经不同了。

最佳答案

来自 UNION 上的文档:

Use the UNION operator to merge the contents of two or more relations. The UNION operator:

  • Does not preserve the order of tuples. Both the input and output relations are interpreted as unordered bags of tuples.
  • Does not ensure (as databases do) that all tuples adhere to the same schema or that they have the same number of fields. In a typical scenario, however, this should be the case; therefore, it is the user's responsibility to either (1) ensure that the tuples in the input relations have the same schema or (2) be able to process varying tuples in the output relation.
  • Does not eliminate duplicate tuples.

重点是我的。这向我表明不需要缩减器步骤来完成 UNION 因为它不需要删除重复的行。您确定 reducer 作业是 UNION 的结果吗?这可能是另一个运算符(operator)的结果。

奖励:您可以将示例简化为:

B = UNION a1, a2, a3 ;

关于hadoop - 如何在 pig 中实现 Union All?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29986092/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com