gpt4 book ai didi

java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException

转载 作者:可可西里 更新时间:2023-11-01 14:32:50 26 4
gpt4 key购买 nike

我正在尝试构建倒排索引。

我链接了两个作业。

基本上,第一个作业解析输入并对其进行清理,并将结果存储在文件夹“output”中,该文件夹是第二个作业的输入文件夹。

第二个工作应该实际构建倒排索引。

当我刚找到第一份工作时,它工作得很好(至少,没有异常(exception))。

我像这样链接两个作业:

public class Main {

public static void main(String[] args) throws Exception {

String inputPath = args[0];
String outputPath = args[1];
String stopWordsPath = args[2];
String finalOutputPath = args[3];

Configuration conf = new Configuration();
conf.set("job.stopwords.path", stopWordsPath);

Job job = Job.getInstance(conf, "Tokenize");

job.setJobName("Tokenize");
job.setJarByClass(TokenizerMapper.class);

job.setNumReduceTasks(1);

FileInputFormat.setInputPaths(job, new Path(inputPath));
FileOutputFormat.setOutputPath(job, new Path(outputPath));

job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(PostingListEntry.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(PostingListEntry.class);

job.setOutputFormatClass(MapFileOutputFormat.class);

job.setMapperClass(TokenizerMapper.class);
job.setReducerClass(TokenizerReducer.class);

// Delete the output directory if it exists already.
Path outputDir = new Path(outputPath);
FileSystem.get(conf).delete(outputDir, true);

long startTime = System.currentTimeMillis();
job.waitForCompletion(true);
System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");

//-------------------------------------------------------------------------

Configuration conf2 = new Configuration();

Job job2 = Job.getInstance(conf2, "BuildIndex");

job2.setJobName("BuildIndex");
job2.setJarByClass(InvertedIndexMapper.class);

job2.setOutputFormatClass(TextOutputFormat.class);

job2.setNumReduceTasks(1);

FileInputFormat.setInputPaths(job2, new Path(outputPath));
FileOutputFormat.setOutputPath(job2, new Path(finalOutputPath));

job2.setOutputKeyClass(Text.class);
job2.setOutputValueClass(PostingListEntry.class);

job2.setMapperClass(InvertedIndexMapper.class);
job2.setReducerClass(InvertedIndexReducer.class);

// Delete the output directory if it exists already.
Path finalOutputDir = new Path(finalOutputPath);
FileSystem.get(conf2).delete(finalOutputDir, true);

startTime = System.currentTimeMillis();
// THIS LINE GIVES ERROR:
job2.waitForCompletion(true);
System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
}
}

我得到一个

Exception in thread "main" java.lang.NullPointerException
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
at Main.main(Main.java:79)

这个配置有什么问题,我应该如何链接作业?

最佳答案

不清楚您是否有意使用 MapFileOutputFormat 作为第一个作业的输出格式。更常见的方法是在第二个作业中使用 SequenceFileOutputFormatSequenceFileInputFormat 作为输入格式。

目前,您已将 MapFileOutputFormat 指定为第一个作业的输出,而第二个作业中没有指定输入,因此它将是 TextInputFormat,这不太可能工作。

查看您的 TokenizerReducer 类,reduce 方法的签名不正确。你有:

public void reduce(Text key, Iterator<PostingListEntry> values, Context context)

应该是:

public void reduce(Key key, Iterable<PostingListEntry> values, Context context)

正因为如此,它不会调用您的实现,所以它只是身份缩减。

关于java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39823494/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com