gpt4 book ai didi

hadoop - Apache pig : Filter one tuple on another?

转载 作者:可可西里 更新时间:2023-11-01 16:34:16 26 4
gpt4 key购买 nike

我想根据 col2 中的条件,并在操作 col2 之后,通过拆分两个元组(或 Pig 中的任何名称)来运行 Pig 脚本,进入另一列,比较两个被操纵的元组并进行额外的排除。

REGISTER /home/user1/piggybank.jar;

log = LOAD '../user2/hadoop_file.txt' AS (col1, col2);

--log = LIMIT log 1000000;
isnt_filtered = FILTER log BY (NOT col2 == 'Some value');
isnt_generated = FOREACH isnt_filtered GENERATE col2, col1, RANDOM() * 1000000 AS random, com.some.valueManipulation(col1) AS isnt_manipulated;

is_filtered = FILTER log BY (col2 == 'Some value');
is_generated = FOREACH is_filtered GENERATE com.some.calculation(col1) AS is_manipulated;
is_distinct = DISTINCT is_generated;

拆分和操作是比较容易的部分。这就是它变得复杂的地方。 . .

merge_filtered = FOREACH is_generated {FILTER isnt_generated BY (NOT isnt_manipulated == is_generated.is_manipulated)};

如果我能弄清楚这一行,剩下的就到位了。

merge_ordered = ORDER merge_filtered BY random, col2, col1;
merge_limited = LIMIT merge_ordered 400000;

STORE merge_limited into 'file';

这是一个 I/O 示例:

col1                col2            manipulated
This qWerty W
Is qweRty R
An qwertY Y
Example qwErty E
Of qwerTy T
Example Qwerty Q
Data qWerty W


isnt
E
Y


col1 col2
This qWerty
Is qweRty
Of qwerTy
Example Qwerty
Data qWerty

最佳答案

我仍然不确定您需要什么,但我相信您可以使用以下内容(未经测试)重现您的输入和输出:

data = LOAD 'input' AS (col1:chararray, col2:chararray);
exclude = LOAD 'exclude' AS (excl:chararray);

m = FOREACH data GENERATE col1, col2, YourUDF(col2) AS manipulated;
test = COGROUP m BY manipulated, exclude BY excl;

-- Here you can choose IsEmpty or NOT IsEmpty according to whether you want to exclude or include
final = FOREACH (FILTER test BY IsEmpty(exclude)) GENERATE FLATTEN(m);

使用 COGROUP,您可以按分组键对每个关系中的所有元组进行分组。如果来自 exclude 的元组包为空,则意味着分组键不存在于排除列表中,因此您可以使用该键保留来自 m 的元组。相反,如果分组键出现在 exclude 中,则该包将不为空,并且 m 中具有该键的元组将被过滤掉。

关于hadoop - Apache pig : Filter one tuple on another?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13424947/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com