gpt4 book ai didi

json - Pig脚本/命令过滤特定字符串上的文件

转载 作者:行者123 更新时间:2023-12-02 22:03:56 25 4
gpt4 key购买 nike

我正在尝试编写Hadoop Pig脚本,该脚本将包含2个文件并根据字符串进行过滤,即

words.txt

google 
facebook
twitter
linkedin

tweets.json
{"created_time": "18:47:31 ", "text": "RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...", "user_id": 450990391, "id": 252479809098223616, "created_date": "Sun Sep 30 2012"}

脚本
twitter  = LOAD 'Twitter.json' USING JsonLoader('created_time:chararray, text:chararray, user_id:chararray, id:chararray, created_date:chararray');
filtered = FILTER twitter BY (text MATCHES '.*facebook.*');
extracted = FOREACH filtered GENERATE 'facebook' AS pattern,id, user_id, created_time, created_date, text;
final = GROUP extracted BY pattern;
dump final;

输出
(facebook,{(facebook,252545104890449921,291041644,23:06:59 ,Sun Sep 30 2012,RT @Joey7Barton: ..give a facebook about whether the americans wins a Ryder cup. I mean surely he has slightly more important matters. #fami ...)})

我得到的输出是 而没有加载word.txt文件,即通过直接过滤tweet。

我需要获得输出
(facebook)(complete tweet of that facebook word contained)

也就是说,它应该读取words.txt,并且在读取单词时根据它应该从tweets.json文件中获取所有tweets

任何帮助

莫汉

最佳答案

您可以考虑在FOREACH语句中运行多个语句的方向。像这样的东西

final = FOREACH words  {
a = CONCAT(CONCAT('.*',words.$0),'.*') as aaa;
filtered = FILTER twitter BY (text MATCHES aaa);
generate a, flatten(filtered) as output; }

请注意,这仅是一个想法,我尚未对其进行测试。进入Pig环境后,我将立即尝试测试,但这应该可以帮助您入门。

关于json - Pig脚本/命令过滤特定字符串上的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39264956/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com