Hadoop - 使用 MultipleInputs 加入可能会跳过 Reducer-6ren

Hadoop - 使用 MultipleInputs 加入可能会跳过 Reducer

转载作者：行者123 更新时间：2023-12-02 21:57:54

25

4

所以，我想与 MR 执行 reduce side join。 (没有 Hive 或任何东西，我正在尝试 Vanilla Hadoop atm)。

我有 2 个输入文件，首先是这样的:
12 13
12 15
12 16
12 23

第二个只是 12 1000。

因此，我将每个文件分配给一个单独的映射器，该映射器实际上根据其源文件将每个键值对标记为 0 或 1。这很好用。我怎么知道？
我按预期得到了 MapOutput:

|关键 | |值(value)|
12 0 1000
12 1 13
12 1 15
12 1 16 等

我的 Partitioner 分区基于 key 的第一部分(即 12)。
Reducer 应该按键加入。然而，这项工作似乎跳过了减少步骤。

我想知道我的驱动程序是否有问题？

我的代码(Hadoop v0.22，但与 0.20.2 的结果相同，带有来自主干的额外库):

映射器

public static class JoinDegreeListMapper extends
        Mapper<Text, Text, TextPair, Text> {
    public void map(Text node, Text degree, Context context)
            throws IOException, InterruptedException {

        context.write(new TextPair(node.toString(), "0"), degree);

    }
}

public static class JoinEdgeListMapper extends
        Mapper<Text, Text, TextPair, Text> {
    public void map(Text firstNode, Text secondNode, Context context)
            throws IOException, InterruptedException {

        context.write(new TextPair(firstNode.toString(), "1"), secondNode);

    }
}

reducer

public static class JoinOnFirstReducer extends
        Reducer<TextPair, Text, Text, Text> {
    public void reduce(TextPair key, Iterator<Text> values, Context context)
            throws IOException, InterruptedException {

        context.progress();
        Text nodeDegree = new Text(values.next());
        while (values.hasNext()) {
            Text secondNode = values.next();
            Text outValue = new Text(nodeDegree.toString() + "\t"
                    + secondNode.toString());
            context.write(key.getFirst(), outValue);
        }
    }
}

分区器

public static class JoinOnFirstPartitioner extends
        Partitioner<TextPair, Text> {

    @Override
    public int getPartition(TextPair key, Text Value, int numOfPartitions) {
        return (key.getFirst().hashCode() & Integer.MAX_VALUE) % numOfPartitions;
    }
}

司机

public int run(String[] args) throws Exception {


    Path edgeListPath = new Path(args[0]);
    Path nodeListPath = new Path(args[1]);
    Path outputPath = new Path(args[2]);

    Configuration conf = getConf();

    Job job = new Job(conf);
    job.setJarByClass(JoinOnFirstNode.class);
    job.setJobName("Tag first node with degree");

    job.setPartitionerClass(JoinOnFirstPartitioner.class);
    job.setGroupingComparatorClass(TextPair.FirstComparator.class);
    //job.setSortComparatorClass(TextPair.FirstComparator.class);
    job.setReducerClass(JoinOnFirstReducer.class);

    job.setMapOutputKeyClass(TextPair.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);


    MultipleInputs.addInputPath(job, edgeListPath, EdgeInputFormat.class,
            JoinEdgeListMapper.class);
    MultipleInputs.addInputPath(job, nodeListPath, EdgeInputFormat.class,
            JoinDegreeListMapper.class);

            FileOutputFormat.setOutputPath(job, outputPath);


    return job.waitForCompletion(true) ? 0 : 1;

}

最佳答案

我的 reduce 函数有 Iterator<> 而不是 Iterable，所以工作跳到了 Identity Reducer。
我不敢相信我忽略了这一点。菜鸟错误。

答案来自这个 Q/A
Using Hadoop for the First Time, MapReduce Job does not run Reduce Phase

关于Hadoop - 使用 MultipleInputs 加入可能会跳过 Reducer，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9035244/

25

4

0

文章推荐： logging - HDFS中使用的不同日志文件大小背后的原理是什么

文章推荐： hadoop - Hadoop古怪行为

文章推荐： java - 使用Java代码进行映射器和化简器的EMR流作业

文章推荐： hadoop - 关于要使用的 Hadoop 发行版？

java - Hadoop MultipleInputs，具有不同分隔符的TextInputFormat
如何最简单地运行多个不同的映射器类(使用 MultipleInputs)，所有映射器类都使用相同的输入格式，但使用不同的输入分隔符？ MultipleInput 允许您添加多个映射器，每个映射器都有自
hadoop - 从 MultipleInputs 获取所有输入路径
我的 mapreduce 作业之一是使用 MultipleInputs。工作完成后，我想删除输入文件。不幸的是，MultipleInputs 的 API 非常有限，并且不提供 FileInputFor
Hadoop - 使用 MultipleInputs 加入可能会跳过 Reducer
所以，我想与 MR 执行 reduce side join。 (没有 Hive 或任何东西，我正在尝试 Vanilla Hadoop atm)。我有 2 个输入文件，首先是这样的: 12 13 12
java - Hadoop Mapreduce MultipleInputs 无法加载映射器类
我在我们的 Yarn 集群上使用新的 MapReduce Api。我需要从两个不同的目录中读取两种不同格式的文件。为此，我决定使用 MultipleInputs 来指定两个映射器类。以下是我的工作驱动
java - 如何使用 MultipleInputs 在映射器中获取文档 ID
出于学习目的，我正在使用 Java 中的 Hadoop(没有 Pig 或 Hive)编写 TF-IDF。我将把它分成三轮:字数统计、每个文档的字数统计，最后是每个字的 docCount。我相信主要的
hadoop MultipleInputs 因 ClassCastException 而失败
我的hadoop版本是1.0.3，当我使用multipleinputs时，我得到了这个错误。 java.lang.ClassCastException: org.apache.hadoop.mapre
java - 具有 MultipleInputs 的 Hadoop 映射器的控制流程是什么？
目标:实现Reduce Side Join 我的代码中目前有作业链(两个作业)。现在我想在减少端加入另一份工作。而且我必须接受多个输入: Input #1:前一个 reducer 的输出。 Input
hadoop - 可以使 Hadoop MultipleInputs.addInputPath 递归工作吗？
最近版本的 Hadoop 已经使用 FileInputFormat.setInputDirRecursive 轻松支持嵌套输入目录，它依赖于 mapreduce.input.fileinputform
java - Hadoop MultipleInputs 因 RuntimeException 而失败
我的现有系统从特定文件夹读取所有文件，并在其上运行 MapReduce。代码如下: Path path = new Path(inputPath) if (!FileSystem.ge
java - 如何在 mapreduce 中使用 MultipleInput 类？
我有一个问题。我需要两个文件作为 mapreduce 程序的输入。 @Override public int run(String[] args) throws Exception { (a
hadoop - MultipleInputs 不工作 - Hadoop 2.5.0
我正在尝试编写一个程序，其中包含 2 个同时执行的映射器和一个缩减器。每个映射器都有不同的输入文件。基本上，我正在尝试进行减少端连接。但是当我通过以下方式声明我的工作时出现错误: public sta
hadoop - 我可以在 Hadoop 中将 HCatInputFormat 与 MultipleInputs 一起使用吗？
我正在尝试连接两个数据集，一个存储在 Hive 表中，另一个不存储。根据人们的做法，我看到这不是很正常，因为他们要么将所有内容定义为 Hive 表，要么不定义。现在有 MultipleInputs
java - EMR 中的多个输入和多个映射器类(EMR 中是否有类似 Hadoop 上的 MultipleInputs 的东西)
我在使用 hadoop 时使用了 MultipleInputs 。因为我有多个映射器分配给不同的输入。我想知道 EMR 是否也支持它。在hadoop中我是这样操作的。这些是我的不同文件的映射器。在这

首页

博学

6Ren·AI

商城

Hadoop - 使用 MultipleInputs 加入可能会跳过 Reducer