hadoop - 将 Hadoop DistributedCache 与存档一起使用-6ren

hadoop - 将 Hadoop DistributedCache 与存档一起使用

转载作者：行者123 更新时间：2023-12-02 20:07:09

25

4

Hadoop的DistributedCache文档似乎没有充分描述如何使用分布式缓存。这是给出的示例:

 // Setting up the cache for the application

 1. Copy the requisite files to the FileSystem:

 $ bin/hadoop fs -copyFromLocal lookup.dat /myapp/lookup.dat  
 $ bin/hadoop fs -copyFromLocal map.zip /myapp/map.zip  
 $ bin/hadoop fs -copyFromLocal mylib.jar /myapp/mylib.jar
 $ bin/hadoop fs -copyFromLocal mytar.tar /myapp/mytar.tar
 $ bin/hadoop fs -copyFromLocal mytgz.tgz /myapp/mytgz.tgz
 $ bin/hadoop fs -copyFromLocal mytargz.tar.gz /myapp/mytargz.tar.gz

 2. Setup the application's JobConf:

 JobConf job = new JobConf();
 DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), 
                               job);
 DistributedCache.addCacheArchive(new URI("/myapp/map.zip", job);
 DistributedCache.addFileToClassPath(new Path("/myapp/mylib.jar"), job);
 DistributedCache.addCacheArchive(new URI("/myapp/mytar.tar", job);
 DistributedCache.addCacheArchive(new URI("/myapp/mytgz.tgz", job);
 DistributedCache.addCacheArchive(new URI("/myapp/mytargz.tar.gz", job);

 3. Use the cached files in the Mapper
 or Reducer:

 public static class MapClass extends MapReduceBase  
 implements Mapper<K, V, K, V> {

   private Path[] localArchives;
   private Path[] localFiles;

   public void configure(JobConf job) {
     // Get the cached archives/files
     File f = new File("./map.zip/some/file/in/zip.txt");
   }

   public void map(K key, V value, 
                   OutputCollector<K, V> output, Reporter reporter) 
   throws IOException {
     // Use data from the cached archives/files here
     // ...
     // ...
     output.collect(k, v);
   }
 }

我已经搜索了一个多小时试图弄清楚如何使用它。在拼凑了其他一些 SO 问题之后，这就是我想出的:

public static void main(String[] args) throws Exception {
    Job job = new Job(new JobConf(), "Job Name");
    JobConf conf = job.getConfiguration();
    DistributedCache.createSymlink(conf);
    DistributedCache.addCacheArchive(new URI("/ProjectDir/LookupTable.zip", job);
    // *Rest of configuration code*
}

public static class MyMapper extends Mapper<Object, Text, Text, IntWritable> 
{
    private Path[] localArchives;

    public void configure(JobConf job)
    {
        // Get the cached archive
        File file1 = new File("./LookupTable.zip/file1.dat");   
        BufferedReader br1index = new BufferedReader(new FileInputStream(file1));
    }

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException 
    { // *Map code* }
}

我应该在哪里调用 void configure(JobConf job)功能？

我在哪里使用 private Path[] localArchives目的？

我的代码在 configure() 中吗？以正确的方式访问存档中的文件并将文件与 BufferedReader 链接？

最佳答案

我将回答您关于分布式缓存的新 API 和常见做法的问题

我应该在哪里调用 void configure(JobConf job) 函数？

框架将调用 protected 无效设置(上下文上下文)方法在每个map任务开始时执行一次，与使用缓存文件相关的逻辑通常在这里处理。例如，读取文件并将数据存储在要在 setup() 之后调用的 map() 函数中使用的变量中

我在哪里使用私有(private) Path[] localArchives 对象？

它通常在 setup() 方法中用于检索缓存文件的路径。像这样的东西。

  Path[] localArchive =DistributedCache.getLocalCacheFiles(context.getConfiguration());

我在 configure() 函数中的代码是正确的访问方式吗
存档中的文件并将文件与 BufferedReader 链接？

它缺少对方法的调用来检索存储缓存文件的路径(如上所示)。检索到路径后，可以按如下方式读取文件。

FSDataInputStream in = fs.open(localArchive);
BufferedReader br  = new BufferedReader(new InputStreamReader(in));

关于hadoop - 将 Hadoop DistributedCache 与存档一起使用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21638863/

25

4

0

文章推荐： hadoop - Pig 0.7.0错误2118:无法在Hadoop 1.2.1上创建输入拆分

文章推荐： apache - K8s Nginx 代理未到达 Pod

文章推荐： hadoop - 如何将数据附加到存储在HDFS中的文件

hadoop - DistributedCache 无法访问存档
我可以使用 DistributedCache 访问单个文件，但无法访问存档。在主要方法中，我将存档添加为 DistributedCache.addCacheArchive(new Path("/sto
Hadoop DistributedCache 无法报告状态
在 Hadoop 作业中，我正在映射多个 XML 文件并为每个元素过滤 ID (from -tags) .因为我想将作业限制在一组特定的 ID 中，所以我读入了一个大文件(2.7 GB 中大约有 2.
hadoop - DistributedCache 是否会在每次作业后删除缓存的文件？
DistributedCache 的文档状态: Its efficiency stems from the fact that the files are only copied once per j
java - 如何设置 Hadoop DistributedCache？
当我运行 hadoop 代码添加第三个 jar 时，就像下面的代码: public static void addTmpJar(String jarPath, JobConf conf) throws
hadoop DistributedCache 返回 null
我正在使用 hadoop DistributedCache，但我遇到了一些麻烦。我的 hadoop 处于伪分布式模式。 from here we can see in pseudo-distribut
Hadoop:从 DistributedCache 解析文件
使用较新的 Hadoop API 从 DistributedCache 读取文件的好方法是什么？我已将文件放在 DistributedCache 中，在我的驱动程序代码中包含以下行: Distribu
java - Hadoop DistributedCache 对象在作业期间更改
我正在尝试在 AWS 上运行 KMeans，但在尝试从 DistributedCache 读取更新的集群质心时遇到了以下异常: java.io.IOException: The distributed
java - 为什么 DistributedCache 会破坏我的文件名
我有一个奇怪的问题，DistributedCache 似乎更改了我的文件的名称，它使用原始名称作为父文件夹并将文件添加为子文件夹。即文件夹\文件名.ext 变成文件夹\文件名.ext\文件名.ext
hadoop - 将 Hadoop DistributedCache 与存档一起使用
Hadoop的DistributedCache文档似乎没有充分描述如何使用分布式缓存。这是给出的示例: // Setting up the cache for the application 1.
java - MapReduce 单元测试无法模拟 DistributedCache.getLocalCacheFiles
与 Apache MRUnit在集群上运行之前，我能够在本地对我的 MapReduce 程序进行单元测试。我的程序需要读取 DistributedCache所以我将 DistributedCache
hadoop - MapReduce 程序中 DistributedCache 的替代方案是什么？
当您需要将您的 mapper/reducer 经常使用的少量数据放入 distributedCache 时，DistributedCache 似乎会派上用场。但在某些情况下，您想要放入映射器的数据会非
java - Hadoop MapReduce DistributedCache 使用
我正在尝试重现 MapReduce 设计模式一书中的Bloom Filtering 示例。在下文中，我将仅展示感兴趣的代码: public static class BloomFilteringM
Hadoop:从 DistributedCache 获取文件时出现 FileNotFoundExcepion
我有 2 个节点集群 (v1.04)，主节点和从节点。在 master 上，在 Tool.run() 中，我们使用 addCacheFile() 将两个文件添加到 DistributedCache。文
hadoop - 通过 DistributedCache 读取本地文件时出现 OutofMemoryError
2012 年 11 月 21 日更新: 通过将属性 mapred.child.java.opts 设置为 -Xmx512m 解决了问题。在此之前，我在 core-site.xml 中将 HADOOP_
Spark 中的 Hadoop DistributedCache 功能
我正在寻找类似于Spark中Hadoop的分布式缓存的功能。我需要一个相对较小的数据文件(具有一些索引值)存在于所有节点中以便进行一些计算。有什么方法可以在 Spark 中实现这一点吗？到目前为止，
java - hadoop NoClassDefFoundError 尽管 DistributedCache 设置
我试图摆脱一些NoClassDefFoundError由于一些 jars运行时找不到。所以我输入了我的hdfs系统一些库，我打电话，我把这个 String lib = "/path/to/lib";
java - 从 DistributedCache 读取 Hadoop 作业的分片输出
(标题应该是sharded 以反射(reflect) Hadoops shards 其跨多个文件的输出) 我将多个 Hadoop 作业链接在一起。其中一项早期作业生成的输出比其他作业小几个数量级，因此
hadoop - 对于 DistributedCache 文件 hadoop 有多大算太大？
是否有关于是否使用分布式缓存分发文件的指南？我有一个大小为 86746785 的文件(我使用 hadoop dfs -dus - 不知道这是 bytes 还是什么)。分发这个文件是个好主意吗？最佳
java - 在 Hadoop DistributedCache 上存储 TreeSet
我正在尝试将 TreeSet 存储在 DistributedCache 上，以供 Hadoop map-reduce 作业使用。到目前为止，我有以下用于将文件从 HDFS 添加到 Distribute
hadoop - 使用 DistributedCache 访问 MapFile 时出现 FileNotFoundException
我正在使用以 yarn 模式运行的 hadoop cdf4.7。 hdfs://test1:9100/user/tagdict_builder_output/part-00000 中有一个 MapFi

首页

博学

6Ren·AI

商城

hadoop - 将 Hadoop DistributedCache 与存档一起使用