gpt4 book ai didi

没有重复文件名的hadoop倒排索引

转载 作者:可可西里 更新时间:2023-11-01 15:41:31 25 4
gpt4 key购买 nike

我的输出是:

文字,文件----- ------wordx Doc2, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1, Doc1

我想要的是:

文字,文件----- ------wordx Doc2, Doc1

public static class LineIndexMapper extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {

private final static Text word = new Text();
private final static Text location = new Text();

public void map(LongWritable key, Text val,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {
FileSplit fileSplit = (FileSplit) reporter.getInputSplit();
String fileName = fileSplit.getPath().getName();
location.set(fileName);

String line = val.toString();
StringTokenizer itr = new StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}

public static class LineIndexReducer extends MapReduceBase
implements Reducer<Text, Text, Text, Text> {

public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {

boolean first = true;
StringBuilder toReturn = new StringBuilder();
while (values.hasNext()) {
if (!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(values.next().toString());
}

output.collect(key, new Text(toReturn.toString()));
}
}

为了获得最佳性能 - 我应该在哪里跳过重复出现的文件名? map ,减少或两者兼而有之?ps:我是写MR任务的初学者,也在尝试用我的问题弄清楚编程逻辑。

最佳答案

您将只能删除 Reducer 中的重复项。为此,您可以使用不允许重复的集合。

public void reduce(Text key, Iterator<Text> values,
OutputCollector<Text, Text> output, Reporter reporter)
throws IOException {

// Text's equals() method should be overloaded to make this work
Set<Text> outputValues = new HashSet<Text>();

while (values.hasNext()) {
// make a new Object because Hadoop may mess with original
Text value = new Text(values.next());

// takes care of removing duplicates
outputValues.add(value);
}

boolean first = true;
StringBuilder toReturn = new StringBuilder();
Iterator<Text> outputIter = outputValues.iter();
while (outputIter.hasNext()) {
if (!first) {
toReturn.append(", ");
}
first = false;
toReturn.append(outputIter.next().toString());
}

output.collect(key, new Text(toReturn.toString()));
}

编辑:根据 Chris 的评论将值的副本添加到 Set。

关于没有重复文件名的hadoop倒排索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10305435/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com