具有复合键的 Hadoop 困难-6ren

具有复合键的 Hadoop 困难

转载作者：可可西里更新时间：2023-11-01 15:39:42

28

4

我正在使用 Hadoop 分析 GSOD 数据 (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/)。我选择了 5 年来执行我的实验 (2005 - 2009)。我配置了一个小集群并执行了一个简单的 MapReduce 程序，该程序获取了一年的最高温度记录。

现在我必须创建一个新的 MR 程序，为每个站点统计这些年来发生的所有现象。

我必须分析的文件具有以下结构:

STN--- ...  FRSHTO
722115      110001
722115      011001
722110      111000
722110      001000
722000      001000

STN 列表示站点代码，FRSHTT 表示现象:F - 雾，R - 雨或毛毛雨，S - 雪或冰粒，H - 冰雹，T - 雷声，O - Tornado 或漏斗云。

值为1，表示该现象发生在当天； 0，表示没有发生。

我需要找到如下结果:

722115: F = 1, R = 2, S = 1, O = 2
722110: F = 1, R = 1, S = 2
722000: S = 1

我可以运行 MR 程序，但结果是错误的，给我这些结果:

722115 F, 1
722115 R, 1
722115 R, 1
722115 S, 1
722115 O, 1
722115 O, 1
722110 F, 1
722110 R, 1
722110 S, 1
722110 S, 1
722000 S, 1

我用过这些代码:

映射器.java

public class Mapper extends org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, StationPhenomenun, IntWritable> {
@Override
protected void map(LongWritable key, Text value, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException {
    String line = value.toString();
    // Every file starts with a field description line, so, I ignore this line
    if (!line.startsWith("STN---")) {
        // First field of the line means the station code where data was collected
        String station = line.substring(0, 6);
        String fog = (line.substring(132, 133));
        String rainOrDrizzle = (line.substring(133, 134));
        String snowOrIcePellets = (line.substring(134, 135));
        String hail = (line.substring(135, 136));
        String thunder = (line.substring(136, 137));
        String tornadoOrFunnelCloud = (line.substring(137, 138));

        if (fog.equals("1"))
            context.write(new StationPhenomenun(station,"F"), new IntWritable(1));
        if (rainOrDrizzle.equals("1"))
            context.write(new StationPhenomenun(station,"R"), new IntWritable(1));
        if (snowOrIcePellets.equals("1"))
            context.write(new StationPhenomenun(station,"S"), new IntWritable(1));
        if (hail.equals("1"))
            context.write(new StationPhenomenun(station,"H"), new IntWritable(1));
        if (thunder.equals("1"))
            context.write(new StationPhenomenun(station,"T"), new IntWritable(1));
        if (tornadoOrFunnelCloud.equals("1"))
            context.write(new StationPhenomenun(station,"O"), new IntWritable(1));
    }
}
}

Reducer.java

public class Reducer extends org.apache.hadoop.mapreduce.Reducer<StationPhenomenun, IntWritable, StationPhenomenun, IntWritable> {

protected void reduce(StationPhenomenun key, Iterable<IntWritable> values, org.apache.hadoop.mapreduce.Reducer.Context context) throws IOException, InterruptedException {
int count = 0;        
    for (IntWritable value : values) {
        count++;
    }

    String station = key.getStation().toString();
    String occurence = key.getPhenomenun().toString();

    StationPhenomenun textPair = new StationPhenomenun(station, occurence);
    context.write(textPair, new IntWritable(count));
}
}

StationPhenomenum.java

public class StationPhenomenun implements WritableComparable<StationPhenomenun> {
private String station;
private String phenomenun;
public StationPhenomenun(String station, String phenomenun) {
    this.station = station;
    this.phenomenun = phenomenun;
}
public StationPhenomenun() {
}
public String getStation() {
    return station;
}
public String getPhenomenun() {
    return phenomenun;
}
@Override
public void readFields(DataInput in) throws IOException {
    station = in.readUTF();
    phenomenun = in.readUTF();
}
@Override
public void write(DataOutput out) throws IOException {
    out.writeUTF(station);
    out.writeUTF(phenomenun);
}
@Override
public int compareTo(StationPhenomenun t) {
    int cmp = this.station.compareTo(t.station);
    if (cmp != 0) {
        return cmp;
    }
    return this.phenomenun.compareTo(t.phenomenun);
}    
@Override
public boolean equals(Object obj) {
    if (obj == null) {
        return false;
    }
    if (getClass() != obj.getClass()) {
        return false;
    }
    final StationPhenomenun other = (StationPhenomenun) obj;
    if (this.station != other.station && (this.station == null || !this.station.equals(other.station))) {
        return false;
    }
    if (this.phenomenun != other.phenomenun && (this.phenomenun == null || !this.phenomenun.equals(other.phenomenun))) {
        return false;
    }
    return true;
}
@Override
public int hashCode() {
    return this.station.hashCode() * 163 + this.phenomenun.hashCode();
}
}

NcdcJob.java

public class NcdcJob {
public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf);
    job.setJarByClass(NcdcJob.class);
    FileInputFormat.addInputPath(job, new Path("/user/hadoop/input"));
    FileOutputFormat.setOutputPath(job, new Path("/user/hadoop/station"));
    job.setMapperClass(Mapper.class);
    job.setReducerClass(Reducer.class);
    job.setMapOutputKeyClass(StationPhenomenun.class);
    job.setMapOutputValueClass(IntWritable.class);
    job.setOutputKeyClass(StationPhenomenun.class);
    job.setOutputValueClass(IntWritable.class);
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

有人做过类似的事情吗？

PS.: 我试过这个解决方案 ( Hadoop - composite key ) 但对我不起作用。

最佳答案

只需检查以下 2 个类是否与您的自定义实现相匹配。

 job.setMapperClass(Mapper.class);
 job.setReducerClass(Reducer.class);

我能够通过以下更改获得所需的结果

protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

protected void reduce(StationPhenomenun key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

还将类名更改为 MyMapper 和 MyReducer

722115,1,1,0,0,0,1
722115,0,1,1,0,0,1
722110,1,1,1,0,0,0
722110,0,0,1,0,0,0
722000,0,0,1,0,0,0

对于这个输入集，我可以得到以下结果

StationPhenomenun [station=722000, phenomenun=S]    1
StationPhenomenun [station=722110, phenomenun=F]    1
StationPhenomenun [station=722110, phenomenun=R]    1
StationPhenomenun [station=722110, phenomenun=S]    2
StationPhenomenun [station=722115, phenomenun=F]    1
StationPhenomenun [station=722115, phenomenun=O]    2
StationPhenomenun [station=722115, phenomenun=R]    2
StationPhenomenun [station=722115, phenomenun=S]    1

计算是一样的，你只需要自定义输出的显示方式。

关于具有复合键的 Hadoop 困难，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18381684/

28

4

0

文章推荐： c++ - OpenCV SURF 功能未实现

文章推荐： c++ - 是否在前向声明的类型未定义行为上使用 typeid？

变量类型签名的 Haskell 困难
tuple :: (Integer a,Fractional b) => (a,b,String) tuple = (18,5.55,"Charana") 所以这是给我的错误 ‘Integer’ is
encryption - 为什么加密如此重要/困难？
关闭。这个问题是off-topic .它目前不接受答案。想改进这个问题吗？ Update the question所以它是on-topic用于堆栈溢出。关闭 11 年前。 Improve thi
java - java中的二维数组 - 困难
我已经习惯了python和django，但我最近开始学习java。由于工作原因我没有太多时间，所以错过了很多类(class)，现在我有点困惑，我必须做作业。编辑该程序应该根据每个运动员在自行车和比
PHP 动态求和回显结果(困难)
这是一个困难的问题，但对专业人士来说很容易。我在 mysql 中有以下字段:产品名称、mycost、sellprice 和 stock。因为我需要知道每种产品对我的商店的投资有多少，所以我创建了以下
mysql - 将两个表中的术语配对并插入到一个表中(困难)
我有 3 个表，其中已包含以下行: TBL_TESTER_LIST id tester_type tester_name 1 LMX LMX-01 2 LMX
java - GridBagLayout 困难
我想只使用 GridBagLayout 来布局组件，如图所示。我已经尝试了几个约束，但它永远不会以预期的结果结束，所以我想知道仅使用 GridBagLayout 是否真的可行。难点在于C1、C2、C
php - bind_param 困难
我遇到了以下代码没有结果的问题。但是，如果我取消注释掉指定的行，并注释掉它起作用的 bind_param 行，但这不是破坏了 mysqli 的目的吗？我的 var_dump 给了我的字符串(1)“1”
python - py2exe 困难
这个问题在这里已经有了答案: a good python to exe compiler? [closed] (3 个答案) 关闭 9 年前。有了我之前问题的一些有用答案(见下文)，我决定再试一次
具有复合键的 Hadoop 困难
我正在使用 Hadoop 分析 GSOD 数据 (ftp://ftp.ncdc.noaa.gov/pub/data/gsod/)。我选择了 5 年来执行我的实验 (2005 - 2009)。我配置了一
swift - NSGridView 困难
我在我的 macOS 应用程序的设置面板中使用 NSGridView。我是这样设置的: class GeneralViewController: RootViewController { pr
php - 手动 wp_install() 困难
我正在尝试使用以下代码在 PHP 中自动安装 WordPress 发行版: $base_dir = '/home/username/wordpress_location'; chdir($base_d
javascript - 将图像转换为 Base64 困难
在 Node.js 中将图像转换为 Base64 字符串时，我遇到了一个非常令人困惑的问题这是我的示例代码: app.get('/image', (req, res) => { ServerAP
java - 面临主要 Activity 困难
我在尝试运行我的应用程序时遇到一些错误，这里是 logcat java.lang.RuntimeException: Unable to instantiate activity Componen
java - 团队和球员对象 Java 困难
基本上，我正在努力创建一个管理团队和球员的 Java 程序。根据我的理解，我会有一个团队和一个玩家类。在团队类中会有 get 和 set 方法，以及某种形式的集合来正确存储球员，例如数组列表？然后在
Java Swing 布局困惑/困难
我仍在尝试找出 JavaSwing 中的 BorderLayout，这真的很令人沮丧。我希望能够将一个 Pane 拆分为 3 个包含的子面板，但我不完全确定如何包含它。这是我的游戏类，它包含面板
database - 数据库表规范化(2NF)困难
下面的表设计（完整的模式见下文）还有很多需要改进的地方，并且已经造成了许多困难，但是我无法找出如何最好地将它们规范化。这些表格的目的是： ICD9-提供CICD9和CDESC组合的主查找。每个组合在I
postgresql 困难(对我来说)查询
这是我的表格: AB元组表 C 表，其中包含 A.id 和 B.id 的条目 D 表，其中包含带有 C.id 的条目和一个 bool 字段“open” 我想计算 D 表中“open”= true 且具
php - 转换 mysql_result 困难
我在 YouTube 上跟踪了一个相当旧的教程，在视频中他以这种方式使用了 mysql_result: return (mysql_result($result,0) == 1) ? true : f
100% 高度的 css 困难
我正在尝试创建一个左侧面板的页面。该面板有一个页眉、一个内容区域和一个页脚。主面板包装器 div 应该是页面高度的 100%。页眉和页脚没有指定的高度，因为我只希望它们足够大以容纳其文本和填充，而我希
c++ - 模型 View 困难
我有 TreeView ，我想在其中显示用户通过 file_dialog.getOpenFileNames() 选择的文件； file_dialog 是 QFileDialog。我确实创建了模型类:

首页

博学

6Ren·AI

商城

具有复合键的 Hadoop 困难