gpt4 book ai didi

hadoop - Pig有洗牌功能吗?

转载 作者:可可西里 更新时间:2023-11-01 15:05:19 25 4
gpt4 key购买 nike

我的问题是 pig 中是否有内置函数来洗牌元组/包?

raw_record = LOAD '$inputPath' -- USING com.test.parser.TestParser;
record_project = FOREACH raw_record GENERATE
field1,
field2,
field3,
field4;

sl_record = FILTER record_project BY (field1=='1' OR field1=='2');
split sl_record into rec1 if field1=='1',rec2 if field1=='2';
rec2Sample = SAMPLE rec2 $samplingRate;
finalRec1 = FOREACH rec1 GENERATE
-1,
1,
field1,
field2,
field3,
field4;

finalRec2 = FOREACH rec2 GENERATE
1,
1,
field1,
field2,
field3,
field4;

unionRec = UNION finalRec1, finalRec2;
STORE unionRec INTO '$outputPath' USING PigStorage(',');


在上面的例子中,问题出在联合上,我看到所有的 finalRec1 后面跟着所有的 finalRec2。我需要将其洗牌或混合。

我采取的解决方法是:

raw_record = LOAD '$inputPath' -- USING com.test.parser.TestParser;
record_project = FOREACH raw_record GENERATE
field1,
field2,
field3,
field4;

sl_record = FILTER record_project BY (field1=='1' OR field1=='2');
split sl_record into rec1 if field1=='1',rec2 if field1=='2';
rec2Sample = SAMPLE rec2 $samplingRate;
finalRec1 = FOREACH rec1 GENERATE
-1,
1,
field1,
field2,
field3,
field4,
(chararray)RANDOM() AS id;

finalRec2 = FOREACH rec2 GENERATE
1,
1,
field1,
field2,
field3,
field4,
(chararray)RANDOM() AS id;

unionRec = UNION finalRec1, finalRec2;
mixedRec = ORDER unionRec BY id ASC
STORE mixedRec INTO '$outputPath' USING PigStorage(',');

通过这种方式我可以混合它们,但现在我无法编写 pig 单元测试。有什么办法可以直接 shuffle unionRec 并编写 pig 单元测试吗?

测试:

@Test
public void myPigUnitTest {
String []inputs=new String[] {
"inputPath=/src/test/resource/testFile.txt",
"samplingRate=1",
"outputPath=dummy"
};
PigTest pigTest = PigUnitUtil.createPigTest("pathToMyPigFile",inputs);
String [] expectedUnion;
String [] expectedMixedRec;
pigTest.assertOutput("unionRec",expectedUnion);
pigTest.assertOutput("mixedRec",expectedMixedRec);
}

这里的问题是 unionRec 和 mixedRec 有随机数,而且 mixed 的顺序被搞乱了。

最佳答案

我想出了一个解决办法:

raw_record = LOAD '$inputPath' -- USING com.test.parser.TestParser;
record_project = FOREACH raw_record GENERATE
field1,
field2,
field3,
field4;

sl_record = FILTER record_project BY (field1=='1' OR field1=='2');
split sl_record into rec1 if field1=='1',rec2 if field1=='2';
rec2Sample = SAMPLE rec2 $samplingRate;
finalRec1 = FOREACH rec1 GENERATE
-1 as label1,
1 as label2,
field1 as label3,
field2 as label4,
field3 as label5,
field4 as label6;

finalRec2 = FOREACH rec2 GENERATE
-1 as label1,
1 as label2,
field1 as label3,
field2 as label4,
field3 as label5,
field4 as label6;

unionRec = UNION finalRec1, finalRec2;
unionRecWithId = FOREACH unionRec GENERATE label1, label2, label3, label4, label5, label6,(chararray)RANDOM() AS id;
mixedRec = ORDER unionRecWithId by id ASC;
STORE mixedRec INTO '$outputPath' USING PigStorage(',');

现在我验证 unionRec 是否具有预期的所有数据。

关于hadoop - Pig有洗牌功能吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31278853/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com