gpt4 book ai didi

hadoop - 在 pig 中,如何计算包含特定字符串的行数?

转载 作者:可可西里 更新时间:2023-11-01 14:36:38 25 4
gpt4 key购买 nike

假设我有一组目标词:

a b c d

和一个输入文件:

a d f s g e
12399
c a d i f
a 2

那么我应该返回:

a 3
b 0
c 1
d 2

我怎样才能在 pig 身上做到这一点?谢谢!

最佳答案

首先从每行中删除重复的单词,然后运行单词统计。
pig 步:

REGISTER 'udf-1.0-SNAPSHOT.jar'
define tuple_set com.ts.pig.UniqueRecords();
data = load '<file>' using PigStorage();

remove duplicate words from each line

unique= foreach data generate tuple_set($0) as line;
words= foreach unique generate flatten(TOKENIZE(line,' ')) as word;
grouped = group words BY word;
count= foreach grouped GENERATE group, COUNT(words);
dump count;

Pig UDF示例代码:

/**
* This udf removes duplicate words from line
*/
public class UniqueRecords extends EvalFunc<String> {
@Override
public String exec(Tuple tuple) throws IOException {
if (tuple == null || tuple.size() == 0)
return null;
String[] splits=tuple.get(0).toString().split(" ");
Set<String> elements = new HashSet<String>(Arrays.asList(splits));
StringBuilder sb = new StringBuilder();
for(String element:elements ){
sb.append(element+" ");
}
return sb.toString();
}
}

关于hadoop - 在 pig 中,如何计算包含特定字符串的行数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40006751/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com