gpt4 book ai didi

hadoop - pig 的NOT IN功能

转载 作者:行者123 更新时间:2023-12-02 21:05:48 25 4
gpt4 key购买 nike

我正在尝试使用Pig中的DIFF()方法找出两个表(源表和目标表)之间的差异,以实现以下目标:

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);


destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:chararray,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);

cogroupnew= COGROUP sourcenew by ID inner, destnew by ID inner;

diffnew = FOREACH cogroupnew GENERATE DIFF(sourcenew,destnew);

DUMP diffnew;

给出两个表之间的差异,或者如果元组匹配,则返回空包{},在此之前它可以正常工作,下一步是在源文件中找到目标中没有的额外记录,为此
cogroupextrainsource= COGROUP sourcenew by ID inner, destnew by ID;
filterextrainsource= FILTER cogroupextrainsource BY ID NOT (cogroupnew)

预期的 throw 错误。
需要帮助以找到更多资源。
帮助将不胜感激。

谢谢!

最佳答案

您不需要列名ID旁边的$符号。仅当您不想按名称访问列时,才使用$。

cogroupextrainsource = COGROUP sourcenew by ID inner, destnew by ID;

关于hadoop - pig 的NOT IN功能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41952027/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com