java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException-6ren

java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException

转载作者：可可西里更新时间：2023-11-01 14:32:50

26

4

我正在尝试构建倒排索引。

我链接了两个作业。

基本上，第一个作业解析输入并对其进行清理，并将结果存储在文件夹“output”中，该文件夹是第二个作业的输入文件夹。

第二个工作应该实际构建倒排索引。

当我刚找到第一份工作时，它工作得很好(至少，没有异常(exception))。

我像这样链接两个作业:

public class Main {

    public static void main(String[] args) throws Exception {

        String inputPath = args[0];
        String outputPath = args[1];
        String stopWordsPath = args[2];
        String finalOutputPath = args[3];

        Configuration conf = new Configuration();    
        conf.set("job.stopwords.path", stopWordsPath);

        Job job = Job.getInstance(conf, "Tokenize");

        job.setJobName("Tokenize");
        job.setJarByClass(TokenizerMapper.class);

        job.setNumReduceTasks(1);

        FileInputFormat.setInputPaths(job, new Path(inputPath));
        FileOutputFormat.setOutputPath(job, new Path(outputPath));

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(PostingListEntry.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(PostingListEntry.class);

        job.setOutputFormatClass(MapFileOutputFormat.class);

        job.setMapperClass(TokenizerMapper.class);
        job.setReducerClass(TokenizerReducer.class);

        // Delete the output directory if it exists already.
        Path outputDir = new Path(outputPath);
        FileSystem.get(conf).delete(outputDir, true);

        long startTime = System.currentTimeMillis();
        job.waitForCompletion(true);
        System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");

        //-------------------------------------------------------------------------

        Configuration conf2 = new Configuration();    

        Job job2 = Job.getInstance(conf2, "BuildIndex");

        job2.setJobName("BuildIndex");
        job2.setJarByClass(InvertedIndexMapper.class);

        job2.setOutputFormatClass(TextOutputFormat.class);

        job2.setNumReduceTasks(1);

        FileInputFormat.setInputPaths(job2, new Path(outputPath));
        FileOutputFormat.setOutputPath(job2, new Path(finalOutputPath));

        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(PostingListEntry.class);

        job2.setMapperClass(InvertedIndexMapper.class);
        job2.setReducerClass(InvertedIndexReducer.class);

        // Delete the output directory if it exists already.
        Path finalOutputDir = new Path(finalOutputPath);
        FileSystem.get(conf2).delete(finalOutputDir, true);

        startTime = System.currentTimeMillis();
        // THIS LINE GIVES ERROR: 
        job2.waitForCompletion(true);
        System.out.println("Job Finished in " + (System.currentTimeMillis() - startTime) / 1000.0 + " seconds");
    }
}

我得到一个

Exception in thread "main" java.lang.NullPointerException
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex(FileInputFormat.java:444)
    at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:413)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
    at Main.main(Main.java:79)

这个配置有什么问题，我应该如何链接作业？

最佳答案

不清楚您是否有意使用 MapFileOutputFormat 作为第一个作业的输出格式。更常见的方法是在第二个作业中使用 SequenceFileOutputFormat 和 SequenceFileInputFormat 作为输入格式。

目前，您已将 MapFileOutputFormat 指定为第一个作业的输出，而第二个作业中没有指定输入，因此它将是 TextInputFormat，这不太可能工作。

查看您的 TokenizerReducer 类，reduce 方法的签名不正确。你有:

public void reduce(Text key, Iterator<PostingListEntry> values, Context context)

应该是:

public void reduce(Key key, Iterable<PostingListEntry> values, Context context)

正因为如此，它不会调用您的实现，所以它只是身份缩减。

关于java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39823494/

26

4

0

文章推荐： hadoop - 在 Windows 上安装 Titan DB 时出错

文章推荐： hadoop - Kafka 控制台生产者丢失消息

hadoop - 使用自定义 FileInputFormat
如何创建将文件作为单个记录发送到映射器的自定义 FileInputFormat请帮我举一个使用自定义 FileInputFormat 的例子最佳答案您想使用具有以下覆盖的自定义文件输入格式:
Hadoop FileInputFormat isSplitable false
我有一个简短的问题，我想我知道关于 FileInputFormat isSplitable 方法的答案。如果我覆盖此方法以返回 false，自然我将让一个映射器处理一个文件(我只有 1 个文件)。如果
Hadoop:实现自定义 FileInputFormat 类时需要帮助
我正在尝试使用 hadoop 为大学作业实现一些 Map/Reduce 作业。但目前我在实现自定义 FileInputFormat 类以将文件中的全部内容放入我的映射器时完全陷入困境。我从“hado
FileInputFormat，其中文件名是 KEY，文本内容是 VALUE
我想将整个文件用作 MAP 处理的单个记录，文件名作为键。我已阅读以下帖子:How to get Filename/File Contents as key/value input for MAP
hadoop - FileInputFormat.setInputPath 中的 FTP 文件名
我有一个代码可以使用 mapreduce 代码从 FTP 服务器读取数据。我们用来连接ftp服务器的代码如下` String inputPath = args[0]; String o
java - 我们可以为 FileInputFormat 自定义 InputSplit 大小吗？
让我们考虑一个生成 1000 个 map task 的 MapReduce 作业。区 block 大小:128MB最小拆分大小:1MB最大拆分大小:256MB block 大小似乎是限制值。我们能
hadoop - 使用 FileInputFormat 在 map 方法中获取行号
我想知道是否可以在我的 map 方法中获取行号？我的输入文件只是一列值，例如， AppleOrangeBanana 是否可以在我的 map 方法中获取 key: 1, Value: Apple , K
hadoop - 创建拆分时，Hadoop忽略mapreduce.input.fileinputformat.split.maxsize
我们正在使用HDP2.5，并且有一个处理HBase中某些行的作业。我为作业设置了开始键和结束键，并且还尝试设置mapreduce.input.fileinputformat.split.maxsize
java - Hive - Beeline - 如何将异常从 fileinputformat 传递到 beeline
我的 FileInputFormat 有时会抛出异常，我希望用户看到来自异常的消息。有没有办法通知直线异常。它只是显示 Error while processing statement: FAILED
hadoop - Map-reduce JobConf - 添加 FileInputFormat 时出错
我使用以下语法创建了一个 Mapper: public class xyz extends MapReduceBase implements Mapper{ ----- public
java - 使用 FileInputFormat.addInputPaths 递归添加 HDFS 路径
我有一个类似于 HDFS 的结构 a/b/file1.gz a/b/file2.gz a/c/file3.gz a/c/file4.gz 我用的是经典模式 FileInputFormat.addInp
apache-spark - 让 Spark 在专有分布式数据库上工作的最佳方法是什么？ (RDD 或 FileInputFormat)
我们有某种分布式数据存储。我们知道所有内部结构，可以直接访问磁盘上的数据。我正在研究直接在其上部署 Apache Spark 的选项。最好/推荐的方法是什么？写作自定义RDD (源自RDD)
hadoop - 使用哪种 FileInputFormat 读取 Hadoop 存档文件 (HAR) 文件
我使用命令行实用程序创建了一个 har 文件:hadoop archive。如何在 mapreduce 或 spark 中读取 HAR 文件的内容？是否有可以理解 HAR 文件的 FileInput
java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException
我正在尝试构建倒排索引。我链接了两个作业。基本上，第一个作业解析输入并对其进行清理，并将结果存储在文件夹“output”中，该文件夹是第二个作业的输入文件夹。第二个工作应该实际构建倒排索引。当
apache-spark - 从org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits到 Guava 的StopWatch的IllegalAccessError
我试图将我的项目从spark 2.1.1升级到2.3.1，当我更改依赖关系时，遇到以下异常: java.lang.IllegalAccessError: tried to access method
hadoop - 来自 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus 的 guava 秒表的 IllegalAccessError
我正在尝试运行小型 spark 应用程序，但出现以下异常: Exception in thread "main" java.lang.IllegalAccessError: tried to acce

首页

博学

6Ren·AI

商城

java - 链接两个作业时 hadoop.mapreduce.lib.input.FileInputFormat.getBlockIndex 中的 NullPointerException