gpt4 book ai didi

hadoop - 使用 Pig 获取唯一记录的值(value)

转载 作者:可可西里 更新时间:2023-11-01 16:39:24 24 4
gpt4 key购买 nike

下面是输入数据集。

col1,col2,col3,col4,col5

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

基于 col2、col3、col4 将提供唯一记录,我需要从 col1 中获取任何一个值作为唯一记录,并填充为新字段 col6。预期输出如下

col1,col2,col3,col4,col5,col6

key1,111,1,12/11/2016,10,key3
key2,111,1,12/11/2016,10,key3
key3,111,1,12/11/2016,10,key3
key4,222,2,12/22/2016,10,key5
key5,222,2,12/22/2016,10,key5
key6,333,3,12/30/2016,10,key6
key7,111,0,12/11/2016,10,key7

下面是脚本,我遇到了错误。

A = load 'test1.csv' using PigStorage(',');
B = GROUP A by ($1,$2,$3);
C = FOREACH B GENERATE FLATTEN(group), MAX(A.$0);

错误 org.apache.pig.tools.grunt.Grunt - 错误 2106:执行代数函数时出错

最佳答案

看起来是使用 Nested Foreach 的一个很好的用例

引用:https://pig.apache.org/docs/r0.14.0/basic.html#foreach

输入:

key1,111,1,12/11/2016,10
key2,111,1,12/11/2016,10
key3,111,1,12/11/2016,10
key4,222,2,12/22/2016,10
key5,222,2,12/22/2016,10
key6,333,3,12/30/2016,10
key7,111,0,12/11/2016,10

PigScript

A = load 'input.csv' using PigStorage(',')  AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray);
B = FOREACH(GROUP A BY (col2,col3,col4)) {
ordered = ORDER A BY col1 DESC;
latest = LIMIT ordered 1;
GENERATE FLATTEN(A) AS (col1:chararray,col2:chararray,col3:chararray,col4:chararray,col5:chararray), FLATTEN(latest.col1) AS col6:chararray;
};

DUMP B;

输出:

(key1,111,1,12/11/2016,10,key3)
(key2,111,1,12/11/2016,10,key3)
(key3,111,1,12/11/2016,10,key3)
(key4,222,2,12/22/2016,10,key5)
(key5,222,2,12/22/2016,10,key5)
(key6,333,3,12/30/2016,10,key6)
(key7,111,0,12/11/2016,10,key7)

关于hadoop - 使用 Pig 获取唯一记录的值(value),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44635519/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com