gpt4 book ai didi

java - hadoop倒排索引计数

转载 作者:可可西里 更新时间:2023-11-01 15:37:09 26 4
gpt4 key购买 nike

我有两个文件作为输入:

fileA.txt:

learn hadoop
learn java

文件B.txt:

hadoop java
eclipse eclipse

期望的输出:

learn   fileA.txt:2

hadoop fileA.txt:1 , fileB.txt:1

java fileA.txt:1 , fileB.txt:1

eclipse fileB.txt:2

我的归约方法:

public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {

Set<Text> outputValues = new HashSet<Text>();
while (values.hasNext()) {
Text value = new Text(values.next());
// delete duplicates
outputValues.add(value);
}
boolean isfirst = true;
StringBuilder toReturn = new StringBuilder();
Iterator<Text> outputIter = outputValues.iterator();
while (outputIter.hasNext()) {
if (!isfirst) {
toReturn.append("/");
}
isfirst = false;
toReturn.append(outputIter.next().toString());
}
output.collect(key, new Text(toReturn.toString()));
}

我需要计数器的帮助(按文件计算字数)

我成功打印了:

learn   fileA.txt

hadoop fileA.txt / fileB.txt

java fileA.txt / fileB.txt

eclipse fileB.txt

但无法打印每个文件的计数

任何帮助将不胜感激

最佳答案

据我了解,这应该打印出您想要的内容:

@Override
public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
Map<String, Integer> fileToCnt = new HashMap<String, Integer>();
while(values.hasNext()) {
String file = values.next().toString();
Integer current = fileToCnt.get(file);
if (current == null) {
current = 0;
}
fileToCnt.put(file, current + 1);
}
boolean isfirst = true;
StringBuilder toReturn = new StringBuilder();
for (Map.Entry<String, Integer> entry : fileToCnt.entrySet()) {
if (!isfirst) {
toReturn.append(", ");
}
isfirst = false;
toReturn.append(entry.getKey()).append(":").append(entry.getValue());
}
output.collect(key, new Text(toReturn.toString()));
}

关于java - hadoop倒排索引计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23411464/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com