gpt4 book ai didi

hadoop - 对 Pig 中袋子的不同值进行计数

转载 作者:可可西里 更新时间:2023-11-01 14:26:17 28 4
gpt4 key购买 nike

在执行看似两级分组时,我对 Pig 有疑问。例如,假设我有一些示例输入数据,例如:

email_id:chararray    from:chararray        to:bag{recipients:tuple(recipient:chararray)}
e1 user1@example.com {(friend1@example.com),(friend2@example.com),(friend3@myusers.com)}
e2 user1@example.com {(friend1@example.com),(friend4@example.com)}
e3 user1@example.com {(friend5@example.com)}
e4 user2@example.com {(friend2@example.com),(friend4@example.com)}

所以每一行都是一封从用户“发件人”到用户“收件人”的电子邮件。

我最终想要一个所有发件人和他们向其发送电子邮件的所有人的列表,包括每个人发送的电子邮件的数量,从高到低排序,例如:

user1@example.com     {(friend1@example.com, 2), (friend2@example.com, 1), (friend3@example.com, 1), (friend4@example.com, 1), (friend5@example.com, 1)}
user2@example.com {(friend2@example.com, 1), (friend4@example.com, 1)}

如果能提出在 Pig 中解决此问题的最佳方法,我们将不胜感激!

最佳答案

这是脚本的一个版本:

inpt = load '/pig_data/pig_fun/input/from_senders.txt' as (email_id:chararray, from:chararray, to:bag{recipients:tuple(recipient:chararray)});

pivot = foreach inpt generate from, FLATTEN(to);
pivot = foreach pivot generate from, to::recipient as recipient;
dump pivot;
/*
(user1@example.com,friend1@example.com)
(user1@example.com,friend2@example.com)
(user1@example.com,friend3@myusers.com)
(user1@example.com,friend1@example.com)
(user1@example.com,friend4@example.com)
(user1@example.com,friend5@example.com)
(user2@example.com,friend2@example.com)
(user2@example.com,friend4@example.com)
*/

grp = group pivot by (from, recipient);
with_count = foreach grp generate FLATTEN(group), COUNT(pivot) as count;
dump with_count;
/*
(user1@example.com,friend1@example.com,2)
(user1@example.com,friend2@example.com,1)
(user1@example.com,friend3@myusers.com,1)
(user1@example.com,friend4@example.com,1)
(user1@example.com,friend5@example.com,1)
(user2@example.com,friend2@example.com,1)
(user2@example.com,friend4@example.com,1)
*/

to_bag = group with_count by from;
result = foreach to_bag {
order_by_count = order with_count by count desc;
generate group as from, order_by_count.(recipient, count);
};
dump result;
/*
(user1@example.com,{(friend1@example.com,2),(friend2@example.com,1),(friend3@myusers.com,1),(friend4@example.com,1),(friend5@example.com,1)})
(user2@example.com,{(friend2@example.com,1),(friend4@example.com,1)})
*/

希望对您有所帮助。

关于hadoop - 对 Pig 中袋子的不同值进行计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11389962/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com