gpt4 book ai didi

hadoop - 如何在 Pig 相同模式中加入 2 个数据集

转载 作者:可可西里 更新时间:2023-11-01 16:47:26 25 4
gpt4 key购买 nike

您好,我是 Pig 编程的新手,遇到了一个我很难解决的问题:

我有2个数据集

A: (accountId:chararray, title:chararray, genre:chararray)

("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")

B: (accountId:chararray, title:chararray, genre:chararray)

("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")

我想要的结果应该是

(accountId:charray, {(),(),...}

(A123, {("A123", "Harry Potter", "Action/Adventure"),
("A123", "Sherlock Holmes", "Mystery"),
("A123", "Divergent", "Action"),
("A123", "Downton Abbey", "Drama")
})

(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama"),
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})

目前我在做:

ANS = JOIN A BY accountId, B B BY accountId;

但结果看起来像

架构:(accountId:chararray, {(accountId:chararray, title:chararray, genre:chararray), ...})

(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama")}
"B456", {
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})

知道我可能做错了什么。

最佳答案

试试这个:

-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;

JOIN 将简单地 JOIN 两个关系中的行。你想完成两件事:

  • 对每个关系中属于同一帐户的所有行进行分组
  • 加入两个“分组”关系(只获取存在于两个关系中的 ID)

这两个 Action 由 COGROUP 执行。我读到的最好的解释是在这里:http://joshualande.com/cogroup-in-pig/

您的关系现在将包含组键 (ID) 和两个包(一个来自 A,一个来自 B),每个包都包含原始关系中的行;将它们“联合”到一个包中的方法是使用 datafu.jar 中的 BagConcat 函数。 datafu 是一个 PIG UDF 库,里面有很多好东西。您可以在这里阅读:http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html

关于hadoop - 如何在 Pig 相同模式中加入 2 个数据集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36004525/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com