gpt4 book ai didi

hadoop - 在 Pig 中计算统计模式

转载 作者:可可西里 更新时间:2023-11-01 14:46:04 25 4
gpt4 key购买 nike

如何在不使用 UDF 的情况下计算 Apache Pig 中数据集的统计模式?

A,20
A,10
A,10
B,40
B,40
B,20
B,10

data = LOAD 'myData.txt' USING PigStorage(',') AS key, value;
byKey = GROUP data BY key;
mode = FOREACH byKey GENERATE MODE(data.value); -- How to define MODE() ??
DUMP mode;
-- Correct answer: (A, 10), (B, 40)

最佳答案

这是一个版本,每个键只能找到一个结果:

data = LOAD 'mode_data.dat' USING PigStorage(',') AS (key, value);
byKeyValue = GROUP data BY (key, value);
cntKeyValue = FOREACH byKeyValue GENERATE FLATTEN(group) AS (key, value), COUNT(data) as cnt;
byKey = GROUP cntKeyValue BY key;
mode = FOREACH byKey {
freq = ORDER cntKeyValue BY cnt DESC;
topFreq = LIMIT freq 1; -- one of the most frequent values for key of the group
GENERATE FLATTEN(topFreq.(key, value));
};

此版本将为同一键找到所有同样频繁的值:

data = LOAD 'mode_data.dat' USING PigStorage(',') AS (key, value);
byKeyValue = GROUP data BY (key, value);
cntKeyValue = FOREACH byKeyValue GENERATE FLATTEN(group) AS (key, value), COUNT(data) as cnt;
byKey = GROUP cntKeyValue BY key;
mostFreqCnt = FOREACH byKey { -- calculate the biggest count for each key
freq = ORDER cntKeyValue BY cnt DESC;
topFreq = LIMIT freq 1;
GENERATE FLATTEN(topFreq.(key, cnt)) as (key, cnt);
};

modeAll = COGROUP cntKeyValue BY (key, cnt), mostFreqCnt BY (key, cnt); -- get all values with the same count and same key, used cogroup as next command was throwing some errors during execution
mode = FOREACH (FILTER modeAll BY not IsEmpty(mostFreqCnt)) GENERATE FLATTEN(cntKeyValue.(key, value)) as (key, value);

关于hadoop - 在 Pig 中计算统计模式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14057841/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com