gpt4 book ai didi

java - Hadoop索引

转载 作者:行者123 更新时间:2023-12-02 21:31:03 25 4
gpt4 key购买 nike

我正在使用下面的代码,并且在编译此代码时遇到问题。我想要实现的是建立索引词用法,以便每个词都引用文件中每个文件的编号。所以可以说,如果我们在abc.txt中有“boy”,我们会得到类似

男孩/usr/abc.txt:1 3

意思是男孩是文件中的第一个和第三个单词

我在下面使用此代码,并且在编译时看到2个错误。一种是找不到GenericOptionsParser,另一种是找不到文件名。我试图将通用WordCount代码修改为此。有人可以指出我正确的方向吗?

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordIndex {

public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {



//context.getInputSplit();
//Path filePath = ((FileSplit) context.getInputSplit()).getPath();
//String filename = ((FileSplit)context.getInputSplit()).getPath().getName();

String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
//StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {


String fileName = ((org.apache.hadoop.mapreduce.lib.input.FileSplit) context.getInputSplit()).getPath().getName();
word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ filename); // get rid of special char
context.write(word, one);
}
}
}

public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);

Job job = Job.getInstance(conf, "word count");
job.setJarByClass(DocWordIndex.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
}

最佳答案

我按原样使用了您的代码,并进行了3处修改后就可以编译了:

  • 在以下语句中,将filename更改为fileName(fileName中的“N”大写)

    更改:
    word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ filename); 

    至:
    word.set(itr.nextToken().toLowerCase().replaceAll("[^a-z]+","") +" "+ fileName); 
  • 导入的包GenericOptionsParser:

    添加以下导入:
    import org.apache.hadoop.util.GenericOptionsParser;
  • job.setJarByClass()错误。它被设置为DocWordIndex.class而不是WordIndex.class

    更改:
    job.setJarByClass(DocWordIndex.class);

    至:
    job.setJarByClass(WordIndex.class);

  • 这为我编译了代码。

    我的Maven依赖项是(我正在使用Hadoop 2.7.0):
    <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-common</artifactId>
    <version>2.7.0</version>
    </dependency>

    <dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-core</artifactId>
    <version>1.2.1</version>
    </dependency>

    关于java - Hadoop索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34256245/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com