gpt4 book ai didi

hadoop - pig :过滤出关系中的最后一个元组

转载 作者:行者123 更新时间:2023-12-02 21:23:20 24 4
gpt4 key购买 nike

我在hdfs中有下面的数据,我想删除最后一行。

/user/cloudera/test/testfile.csv

Day,TimeCST,Conditions
1,12:53 AM,Clear
1,1:53 AM,Clear
1,2:53 AM,Clear
1,3:53 AM,Clear
1,4:53 AM,Clear
1,5:53 AM,Clear
1,6:53 AM,Clear
1,7:53 AM,Clear
1,8:53 AM,Clear
1,9:53 AM,Clear
1,10:53 AM,Clear
1,11:53 AM,Clear
1,12:53 PM,Clear
1,1:53 PM,Clear
1,2:53 PM,Clear
1,3:53 PM,Clear
1,4:53 PM,Clear
1,5:53 PM,Clear

首先,我对数据进行 load,删除标题,然后获取行数/元组数:
rawdata = LOAD 'hdfs:/user/cloudera/test/testfile.csv' using PigStorage(',') AS (day:int, timecst:chararray, condition:chararray);
filtereddata = FILTER rawdata BY day > 0; --filters out header
rowcount = FOREACH (GROUP filtereddata ALL) GENERATE COUNT_STAR(filtereddata);
dump rowcount; --Prints (18)

接下来,我对数据进行 rank,然后尝试使用生成的行号对最后一行/元组进行 filter:
ranked = RANK filtereddata;
weatherdata = FILTER ranked BY $0 != rowcount.$0;

上面的 filter操作失败,并出现以下错误:
ERROR 2017: Internal error creating job configuration.
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias weatherdata.....

但是,如果我按以下方式将行计数硬编码到脚本中,则该作业运行良好:
weatherdata = FILTER ranked BY $0 != 18;

我想避免对行计数进行硬编码。您是否发现我可能误入歧途?谢谢。

Apache Pig版本0.12.0-cdh5.5.0(rexported)
编译时间:2015年11月9日,12:41:48

最佳答案

可能要抛弃它

weatherdata = FILTER ranked BY $0 != (int)rowcount.$0;

关于hadoop - pig :过滤出关系中的最后一个元组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36753229/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com