java - 在一个驱动程序中运行依赖的 hadoop 作业-6ren

java - 在一个驱动程序中运行依赖的 hadoop 作业

转载作者：可可西里更新时间：2023-11-01 15:41:31

25

4

我目前有两个 hadoop 作业，其中第二个作业需要将第一个作业的输出添加到分布式缓存中。目前我手动运行它们，所以在第一个作业完成后，我将输出文件作为参数传递给第二个作业，它的驱动程序将它添加到缓存中。

第一个作业只是一个简单的 map 作业，我希望在依次执行两个作业时可以运行一个命令。

谁能帮我把第一个作业的输出放到分布式缓存中，以便它可以传递到第二个作业中？

谢谢

编辑:这是作业 1 的当前驱动程序:

public class PlaceDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
        System.err.println("Usage: PlaceMapper <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Place Mapper");
    job.setJarByClass(PlaceDriver.class);
    job.setMapperClass(PlaceMapper.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[0]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

这是 job2 的驱动程序。作业 1 的输出作为第一个参数传递给作业 2 并加载到缓存中

public class LocalityDriver {

public static void main(String[] args) throws Exception {

    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 3) {
        System.err.println("Usage: LocalityDriver <cache> <in> <out>");
        System.exit(2);
    }
    Job job = new Job(conf, "Job Name Here");
    DistributedCache.addCacheFile(new Path(otherArgs[0]).toUri(),job.getConfiguration());
    job.setNumReduceTasks(1); //TODO: Will change
    job.setJarByClass(LocalityDriver.class);
    job.setMapperClass(LocalityMapper.class);
    job.setCombinerClass(TopReducer.class);
    job.setReducerClass(TopReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    TextInputFormat.addInputPath(job, new Path(otherArgs[1]));
    TextOutputFormat.setOutputPath(job, new Path(otherArgs[2]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

最佳答案

在同一个 main 中创建两个作业对象。让第一个等待完成，然后再运行另一个。

public class DefaultTest extends Configured implements Tool{


    public int run(String[] args) throws Exception {

        Job job = new Job();

        job.setJobName("DefaultTest-blockx15");

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        job.setMapperClass(Map.class);
        job.setReducerClass(Reduce.class);

        job.setNumReduceTasks(15);

        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setJarByClass(DefaultTest.class);

        job.waitForCompletion(true):

                job2 = new Job(); 

                // define your second job with the input path defined as the output of the previous job.


        return 0;
    }

    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
        ToolRunner.run(new DefaultTest(), otherArgs);
    }
 }

关于java - 在一个驱动程序中运行依赖的 hadoop 作业，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10309939/

25

4

0

文章推荐：没有重复文件名的hadoop倒排索引

文章推荐： c++ - 不能将 typeof(std::endl) 作为模板参数？

文章推荐： hadoop - Mapreduce - 无法获得正确的 key

linux - 驱动/模块交叉编译
我正在尝试为基于 arm 的板交叉编译驱动程序。在 make 文件中，包含文件的搜索路径是主机内核的路径，即它指向 ubuntu 附带的 linux 头文件。我在主机系统(i7/ubuntu)上也有目
STM32CubeMX教程23FSMC-IS62WV51216(SRAM)驱动
1、准备材料开发板（正点原子stm32f407探索者开发板V2.4）。 STM32CubeMX软件（Version 6.10.0）。野火DAP仿真器。 keil µVis
c# - 通过自己的应用程序运行/驱动 Excel
是否可以通过 c# 应用程序“驱动”excel(即从 excel gui 下拉列表中选择某些内容，按下按钮并读取特定单元格的内容)？这并不是真正用于测试，而是用于类似于 selenium 的数据报废
c# - 驱动 MVVM 应用程序
给定任何具有超过 5 个 View 和 View 模型的中间 MVVM 应用程序，是否有任何推荐的设计模式来说明如何为此类应用程序搭建脚手架？现在我通常有一个在 App.OnStartup 中创建的
java - 驱动 NxN 二维数组搜索的曼哈顿距离
我想知道如何使用曼哈顿距离启发式来驱动 NxN 二维数组中的搜索。我有以下曼哈顿距离: private int manhattan(int[] pos, int tile) { int
c++ - CUDA 驱动 CUmodule
我试图了解 CUmodule 在 CUDA 驱动程序 API 函数中实际上代表什么。许多 CUDA 驱动程序 API 函数都有一个 CUmodule 句柄，但它是什么？它是引导驱动程序调用过程的 d
java - 驱动 Api 电话
我正在尝试创建一个 java 程序，它将创建 excel 文件并将其上传到谷歌驱动器中。上传后我需要它来授予权限。我已经完成了所有这些，但问题在于尝试将 excel 文件转换为 google 文件，以
linux - TIUSB3410 Linux 驱动
我正在拼命尝试从 Linux(Raspbian 内核 4.4.12-v7+)与使用 TIUSB3410 USB 部件的设备进行通信。这是 dmesg 的输出: [ 2730.016013] usb
Linux 驱动 PCI 突发传输
我有一个关于在 PCIe 上使用突发读写的问题。我有一个 fpga，它通过 PCIe 连接到 cpu。我有一个简单的驱动程序，仅用于测试。驱动程序向 FPGA 写入数据以及从 FPGA 读取数据。 f
php - 驱动 Selenium 的配置文件
我有大约 500 条通往特定页面的可能路径，我需要测试所有这些路径。该页面的每个路径看起来都类似于此(使用 PHP 网络驱动程序；通常有大约 10 个步骤): // Navigate to form
python - Chrome 版本自动安装 Chrome 驱动
如果chrome驱动的版本和当前的chrome版本不同，我想写一个python代码，下载并运行与当前chrome版本匹配的chrome驱动。这就是我一直在寻找的东西 driver = webdriv
python - Pyodbc 找不到 FreeTDS 驱动
我在 Centos 7 Linux 机器上尝试通过 pyodbc 连接到 SQL 数据库。我了解到您需要设置 DSN，您可以通过安装 freetds 驱动程序并执行以下操作来实现: import py
nunit - 使用 NUnit 驱动 NDepend
是否可以使用 NUnit 通过 NDepend 运行 CQL 查询？如果能够将 NDepend dll 包含在 UnitTests 库中并编写如下测试，那就太好了: [Test] public voi
Cassandra datastax 驱动 ResultSet 在多个线程中共享以实现快速读取
我在 cassandra 中有巨大的表，超过 20 亿行并且还在增加。这些行有一个日期字段，它遵循日期桶模式以限制每一行。即便如此，对于某个特定日期，我也有超过一百万条条目。我想尽快读取和处理每一
c++ - 从 sc_signal 驱动 sc_out
考虑以下示例，其中一个模块的输出 (inner::out) 应该驱动两个输出(outer::out 和 outer::out2) 的上层层次: #include SC_MODULE(inner) {
mysql - ElFinder 多根 MySQL 驱动
我不确定是否可以有一个具有多个 MySQL 根的连接器。当我尝试只使用一根根时，它效果完美。我的有 2 个根的代码如下所示: [ 'locale' => 'es_ES.UTF-8',
java - Mysql JDBC 驱动 ClassNotFoundException
我的桌面APP无法注册Mysql JDBC驱动我下载mysql-connector-java-5.1.16.zip 解压mysql-connector-java-5.1.16-bin.jar并将其放
Python 驱动 Emacs； pymacs 不工作
我有一个无限循环等待输入的 python 脚本，然后输入发生时做一些事情。我的问题是制作 python告诉 emacs 做某事。我只需要一些方法来发送 emacs 输入并让 emacs 评估该输入。
java - 创建一个检查属性窗口，按钮作为 JDialog 驱动
我最初问的没有明确说明我的问题/问题，所以我会更好地解释它。我有一个将 JDialog 设置为可见的 JButton。 JDialog 有一个 WindowListener 将其设置为在 window
mongodb - 检查是否插入成功(MongoDB C#驱动)
假设“doc”是我想插入到 MongoDB 集合中的一些文档，而“collection”是我要将文档插入到的集合。我有如下内容: try { WriteConcern wc = new Wr

首页

博学

6Ren·AI

商城

java - 在一个驱动程序中运行依赖的 hadoop 作业