gpt4 book ai didi

hadoop - 不了解分布式路径中的路径

转载 作者:可可西里 更新时间:2023-11-01 15:29:30 25 4
gpt4 key购买 nike

从下面的代码中我不明白两件事:

  1. DistributedCache.addcachefile(新 URI ('/abc.dat'), job.getconfiguration())

我不明白 URI 路径必须存在于 HDFS 中。如果我错了,请纠正我。

  1. 下面代码中的p.getname().equals()是什么:

    public class MyDC {

    public static class MyMapper extends Mapper < LongWritable, Text, Text, Text > {

    private Map < String, String > abMap = new HashMap < String, String > ();

    private Text outputKey = new Text();

    private Text outputValue = new Text();

    protected void setup(Context context) throws
    java.io.IOException, InterruptedException {

    Path[] files = DistributedCache.getLocalCacheFiles(context.getConfiguration());

    for (Path p: files) {

    if (p.getName().equals("abc.dat")) {

    BufferedReader reader = new BufferedReader(new FileReader(p.toString()));

    String line = reader.readLine();

    while (line != null) {

    String[] tokens = line.split("\t");

    String ab = tokens[0];

    String state = tokens[1];

    abMap.put(ab, state);

    line = reader.readLine();

    }

    }

    }

    if (abMap.isEmpty()) {

    throw new IOException("Unable to load Abbrevation data.");

    }

    }

    protected void map(LongWritable key, Text value, Context context)
    throws java.io.IOException, InterruptedException {

    String row = value.toString();

    String[] tokens = row.split("\t");

    String inab = tokens[0];

    String state = abMap.get(inab);

    outputKey.set(state);

    outputValue.set(row);

    context.write(outputKey, outputValue);

    }

    }

    public static void main(String[] args)
    throws IOException, ClassNotFoundException, InterruptedException {

    Job job = new Job();

    job.setJarByClass(MyDC.class);

    job.setJobName("DCTest");

    job.setNumReduceTasks(0);

    try {

    DistributedCache.addCacheFile(new URI("/abc.dat"), job.getConfiguration());

    } catch (Exception e) {

    System.out.println(e);

    }

    job.setMapperClass(MyMapper.class);

    job.setMapOutputKeyClass(Text.class);

    job.setMapOutputValueClass(Text.class);


    FileInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);

    }

    }

最佳答案

分布式缓存的思想是在任务节点开始执行之前让一些静态数据可供任务节点使用。

文件必须存在于 HDFS 中,以便它可以将其添加到分布式缓存(到每个任务节点)

DistributedCache.getLocalCacheFile 基本上获取该任务节点中存在的所有缓存文件。通过 if (p.getName().equals("abc.dat")) {,您将获得要由您的应用程序处理的适当缓存文件。

请引用以下文档:

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#DistributedCache

https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/filecache/DistributedCache.html#getLocalCacheFiles(org.apache.hadoop.conf.Configuration)

关于hadoop - 不了解分布式路径中的路径,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37246043/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com