java - 为什么这个使用 Combiner 类的 Hadoop 示例不能正常工作？ (不要执行Combiner提供的 "local reduction")-6ren

java - 为什么这个使用 Combiner 类的 Hadoop 示例不能正常工作？ (不要执行Combiner提供的 "local reduction")

转载作者：可可西里更新时间：2023-11-01 16:47:52

我是 Hadoop 的新手，我正在做一些实验，尝试使用 Combiner 类在映射器的同一节点上本地执行 reduce 操作。我正在使用 Hadoop 1.2.1。

所以我有这 3 个类:

WordCountWithCombiner.java:

// Learning MapReduce by Nitesh Jain
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;

/* 
 * Extend Configured class: g
 * Implement Tool interface:
 * 
 */
public class WordCountWithCombiner extends Configured implements Tool{

  @Override
  public int run(String[] args) throws Exception {
    Configuration conf = getConf(); 
    
    Job job = new Job(conf, "MyJob");   // Job is a "dashboard" with levers to control the execution of the job
    
    job.setJarByClass(WordCountWithCombiner.class);             // Name of the driver class into the jar
    job.setJobName("Word Count With Combiners");    // Set the name of the job

    FileInputFormat.addInputPath(job, new Path(args[0]));           // The input file is the first paramether of the main() method
    FileOutputFormat.setOutputPath(job, new Path(args[1]));         // The output file is the second paramether of the main() method
    
    job.setMapperClass(WordCountMapper.class);          // Set the mapper class
    
    /* Set the combiner: the combiner is a reducer performed locally on the same mapper node (we are resusing the previous WordCountReduces
     * class because it perform the same task, but locally to the mapper):
     */
    job.setCombinerClass(WordCountReducer.class);
    job.setReducerClass(WordCountReducer.class);        // Set the reducer class

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    
    return job.waitForCompletion(true) ? 0 : 1;
   
   }
  
  public static void main(String[] args) throws Exception {
    /* The ToolRunner object is used to trigger the run() function which contains all the batch execution logic. 
     * What it does is gie the ability to set properties at the own time so we need not to write a single line of code to handle it
     */
    int exitCode = ToolRunner.run(new Configuration(), new WordCountWithCombiner(), args);
    System.exit(exitCode);
}

}

WordCountMapper.java:

// Learning MapReduce by Nitesh J.
// Word Count Mapper. 
import java.io.IOException;
import java.util.StringTokenizer;

// Import KEY AND VALUES DATATYPE:
import org.apache.hadoop.io.IntWritable;    // Similiar to Int
import org.apache.hadoop.io.LongWritable;   // Similar to Long
import org.apache.hadoop.io.Text;           // Similar to String

import org.apache.hadoop.mapreduce.Mapper;

/* Every mapper class extend the Hadoop Mapper class.
 * @param input key (the progressive number)
 * @param input type (it is a word so something like a String)
 * @param output key
 * @param output value
 * 
 */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  /* Override the map() function defined by the Mapper extended class:
   * The input parameter have to match with these defined into the extended Mapper class
   * @param context: is used to cast the output key and value paired.
   * 
   * Tokenize the string into words and write these words into the context with words as key, and one (1) as value for each word
   */
  @Override
  public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    
      
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
    
      while (itr.hasMoreTokens()) {
          //just added the below line to convert everything to lower case 
          word.set(itr.nextToken().toLowerCase());
          // the following check is that the word starts with an alphabet. 
          if(Character.isAlphabetic((word.toString().charAt(0)))){
              context.write(word, one);
          }
    }
  }

}

WordCountReducer.java:

// Learning MapReduce by Nitesh Jain
import java.io.IOException;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/* Every reduceer calss have to extender the Hadoop Reducer class
 * @param the mapper output key  (text, the word)
 * @param the mapper output value (the number of occurrence of the related word: 1)
 * @param the redurcer output key (the word)
 * @param the reducer output value (the number of occurrence of the related word)
 * Have to map the Mapper() param
 */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    
    /*
     * I have to override the reduce() function defined by the extended Reducer class
     * @param key: the current word
     * @param Iterable<IntWritable> values: because the input of the recudce() function is a key and a list of values associated to this key
     * @param context: collects the output <key, values> pairs
     */
    @Override
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
        throws IOException, InterruptedException {
        
        int sum = 0;
        for (IntWritable value : values) {
          sum += value.get();
        }
        context.write(key, new IntWritable(sum));
      }

}

正如您在 WordCountWithCombiner 驱动程序类中看到的那样，我已将 WordCountReducer 类设置为组合器以直接在映射器节点上执行缩减，通过以下行:

job.setCombinerClass(WordCountReducer.class);

然后我在 Hadoop 文件系统上有这个输入文件:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  in
to be or not to be

我想对其进行操作。

如果我通过 map 和 reduce 的 2 阶段以经典方式执行前一批，它工作正常，实际上在 Linux shell 中执行此语句:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop jar WordCount.jar WordCountWithCombiner in out6

Hadoop 让它工作，然后我得到了预期的结果:

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  out6/p*
be  2
not 1
or  1
to  2
andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$

好的，它工作正常。

问题是现在我不想执行 reduce 阶段，我希望得到相同的结果，因为我已经设置了在 reducer 的同一节点上执行相同操作的组合器。

因此，在 Linux shell 中，我执行排除了 reducer 阶段的语句:

hadoop jar WordCountWithCombiner.jar WordCountWithCombiner -D mapred.reduce.tasks=0 in out7

但它不能正常工作，因为这是我获得的(我发布了整个输出以添加有关正在发生的事情的更多信息):

andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop jar WordCountWithCombiner.jar WordCountWithCombiner -D mapred.reduce.tasks=0 in out7
16/02/13 19:43:44 INFO input.FileInputFormat: Total input paths to process : 1
16/02/13 19:43:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
16/02/13 19:43:44 WARN snappy.LoadSnappy: Snappy native library not loaded
16/02/13 19:43:45 INFO mapred.JobClient: Running job: job_201601242121_0008
16/02/13 19:43:46 INFO mapred.JobClient:  map 0% reduce 0%
16/02/13 19:44:00 INFO mapred.JobClient:  map 100% reduce 0%
16/02/13 19:44:05 INFO mapred.JobClient: Job complete: job_201601242121_0008
16/02/13 19:44:05 INFO mapred.JobClient: Counters: 19
16/02/13 19:44:05 INFO mapred.JobClient:   Job Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=18645
16/02/13 19:44:05 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
16/02/13 19:44:05 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
16/02/13 19:44:05 INFO mapred.JobClient:     Launched map tasks=1
16/02/13 19:44:05 INFO mapred.JobClient:     Data-local map tasks=1
16/02/13 19:44:05 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=0
16/02/13 19:44:05 INFO mapred.JobClient:   File Output Format Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     Bytes Written=31
16/02/13 19:44:05 INFO mapred.JobClient:   FileSystemCounters
16/02/13 19:44:05 INFO mapred.JobClient:     HDFS_BYTES_READ=120
16/02/13 19:44:05 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=55503
16/02/13 19:44:05 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=31
16/02/13 19:44:05 INFO mapred.JobClient:   File Input Format Counters 
16/02/13 19:44:05 INFO mapred.JobClient:     Bytes Read=19
16/02/13 19:44:05 INFO mapred.JobClient:   Map-Reduce Framework
16/02/13 19:44:05 INFO mapred.JobClient:     Map input records=1
16/02/13 19:44:05 INFO mapred.JobClient:     Physical memory (bytes) snapshot=93282304
16/02/13 19:44:05 INFO mapred.JobClient:     Spilled Records=0
16/02/13 19:44:05 INFO mapred.JobClient:     CPU time spent (ms)=2870
16/02/13 19:44:05 INFO mapred.JobClient:     Total committed heap usage (bytes)=58195968
16/02/13 19:44:05 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=682741760
16/02/13 19:44:05 INFO mapred.JobClient:     Map output records=6
16/02/13 19:44:05 INFO mapred.JobClient:     SPLIT_RAW_BYTES=101
andrea@andrea-virtual-machine:~/workspace/HadoopExperiment/bin$ hadoop fs -cat  out7/p*to   1
be  1
or  1
not 1
to  1
be  1

如您所见，Combiner 提供的本地缩减似乎不起作用。

为什么？我错过了什么？我该如何尝试解决这个问题？

最佳答案

不要假设组合器会运行。仅将组合器视为优化。 Combiner 不保证运行所有数据。在某些不需要将数据溢出到磁盘的情况下，MapReduce 将完全跳过使用 Combiner。另请注意，组合器可能会在数据子集上运行多次!它会在每次溢出时运行一次。

因此，当 reducer 的数量设置为 0 时，这实际上并不意味着它应该给出正确的结果，因为所有映射器数据都没有被组合器覆盖。

关于java - 为什么这个使用 Combiner 类的 Hadoop 示例不能正常工作？ (不要执行Combiner提供的 "local reduction")，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35383922/

文章推荐： java - 在 java 中检索响应的 http 代码

文章推荐： java - 在哪些方法调用 java 后向服务器发送真实请求？

IPv6 示例 Wireshark 示例
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Sample data for IPv6? 除了 wireshark 在其网站上提供的内容之外，是否有可以下
c# - WPF 中的多拖放——示例/示例/教程？
我正在寻找可以集成到现有应用程序中并使用多拖放功能的示例或任何现成的解决方案。我在互联网上找到的大多数解决方案在将多个项目从 ListBox 等控件拖放到另一个 ListBox 时效果不佳。谁能指出我
java - GATE Embedded 示例示例 NoClassFound 错误
我是 GATE Embedded 的新手，我尝试了简单的示例并得到了 NoClassDefFoundError。首先我会解释我尝试了什么在 D:\project\gate-7.0 中下载并提取 Ga
eclipse-rcp - Eclipse 中的 JFace 示例，如 SWT 示例？
是否有像 Eclipse 中的 SWT 示例那样的多合一 JFace 控件示例？搜索(在 stackoverflow.com 上使用谷歌搜索和搜索)对我没有帮助。如果它是一个独立的应用程序或 ecl
google-compute-engine - Google 计算引擎 .NET API 示例/示例/教程
我找不到任何可以清楚地解释如何通过 .net API(特别是 c#)使用谷歌计算引擎的内容。有没有人可以指点我什么？附言我知道 API 引用 ( https://developers.google.
基于Basicauth的一个C#示例
最近在做公司的一个项目时，客户需要我们定时获取他们矩阵系统的数据。在与客户进行对接时，提到他们的接口使用的目前不常用的BASIC 认证。天呢，它好不安全，容易被不法人监听，咋还在使用呀。但是没办法呀，
基于Basicauth的一个C#示例
最近在做公司的一个项目时，客户需要我们定时获取他们矩阵系统的数据。在与客户进行对接时，提到他们的接口使用的目前不常用的BASIC 认证。天呢，它好不安全，容易被不法人监听，咋还在使用呀。但是没办法呀，
YAML 示例
我正在尝试为我的应用程序设计配置文件格式并选择了 YAML。但是，这(显然)意味着我需要能够定义、解析和验证正确的 YAML 语法! 在配置文件中，必须有一个名为 widgets 的集合/序列。 .这
python - 示例
你能给我一个使用 pysmb 库连接到一些 samba 服务器的例子吗？我读过有类 smb.SMBConnection.SMBConnection(用户名、密码、my_name、remote_name
示例：iptables限制ssh链接服务器
linux服务器默认通过22端口用ssh协议登录，这种不安全。今天想做限制，即允许部分来源ip连接服务器。案例目标：通过iptables规则限制对linux服务器的登录。处理方法：编
Sonarqube PostProjectAnalysisTask 示例？
我一直在寻找任何 PostProjectAnalysisTask 工作代码示例，但没有看。 This页面指出 HipChat plugin使用这个钩子(Hook)，但在我看来它仍然使用遗留的 Po
GWT CustomScrollPanel 示例
我发现了 GWT 的 CustomScrollPanel 以及如何自定义滚动条，但我找不到任何示例或如何设置它。是否有任何示例显示正在使用的自定义滚动条？最佳答案这是自定义 native 滚动条的
Marionette CRUD 示例
我正在尝试开发一个 Backbone Marionette 应用程序，我需要知道如何以最佳方式执行 CRUD(创建、读取、更新和销毁)操作。我找不到任何解释这一点的资源(仅适用于 Backbone)。
Android BLE 示例
关闭。这个问题需要details or clarity .它目前不接受答案。想改进这个问题？通过 editing this post 添加详细信息并澄清问题. 去年关闭。 Improve this
将多个实例提交到数据库的表单的 Django 示例？
我需要一个提交多个单独请求的 django 表单，如果没有大量定制，我找不到如何做到这一点的示例。即，假设有一个汽车维修店使用的表格。该表格将列出商店能够进行的所有可能的维修，并且用户将选择他们想要进
spring - MultiTenantSpringLiquibase 示例。
我有一个 Multi-Tenancy 应用程序。然而，这个相同的应用程序有 liquibase。我需要在我的所有数据源中运行 liquibase，但是我不能使用这个 Bean。我的应用程序.yml
业务应用程序的 TDD 示例
我了解有关单元测试的一般思想，并已在系统中发生复杂交互的场景中使用它，但我仍然对所有这些原则结合在一起有疑问。我们被警告不要测试框架或数据库。好的 UI 设计不适合非人工测试。 MVC 框架不包括一
Clojure For Comprehension 示例
我正在使用 docjure并且它的 select-columns 函数需要一个列映射。我想获取所有列而无需手动指定。如何将以下内容生成为惰性无限向量序列 [:A :B :C :D :E ... :A
yii - findByAttributes 示例
$condition使用说明和 $param在 findByAttributes在 Yii 在大多数情况下，这就是我使用 findByAttributes 的方式 Person::model()->f
未启用 qtcreator 示例
我在 Ubuntu 11.10 上安装了 qtcreator sudo apt-get install qtcreator 安装的版本有:QT Creator 2.2.1、QT 4.7.3 当我启动

可可西里

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

java - 为什么这个使用 Combiner 类的 Hadoop 示例不能正常工作？ (不要执行Combiner提供的 "local reduction")