当文件格式为自定义格式时，Hadoop MultipleOutputs 不会写入多个文件-6ren

当文件格式为自定义格式时，Hadoop MultipleOutputs 不会写入多个文件

转载作者：可可西里更新时间：2023-11-01 14:47:40

25

4

我正在尝试从 cassandra 中读取并使用 MultipleOutputs api(Hadoop 版本 1.0.3)将 reducers 输出写入多个输出文件。在我的案例中，文件格式是扩展 FileOutputFormat 的自定义输出格式。我已按照 MultipleOutputs api 中所示的类似方式配置了我的作业.但是，当我运行作业时，我只得到一个名为 part-r-0000 的输出文件，它是文本输出格式。如果未设置 job.setOutputFormatClass()，默认情况下它会将 TextOutputFormat 视为格式。此外，它只允许初始化两个格式类之一。它完全忽略了我在 MulitpleOutputs.addNamedOutput(job, "format1", MyCustomFileFormat1.class, Text.class, Text.class) 和 MulitpleOutputs.addNamedOutput(job, "format2", MyCustomFileFormat2.class, Text .class, Text.class).其他人是否面临类似问题，还是我做错了什么？

我还尝试编写一个非常简单的 MR 程序，该程序从文本文件读取并以 MultipleOutputs api 中所示的 TextOutputFormat 和 SequenceFileOutputFormat 两种格式写入输出。但是，那里也没有运气。我只得到 1 个文本输出格式的输出文件。

有人可以帮我解决这个问题吗？

Job job = new Job(getConf(), "cfdefGen");
job.setJarByClass(CfdefGeneration.class);

//read input from cassandra column family
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY);
job.setInputFormatClass(ColumnFamilyInputFormat.class);
job.getConfiguration().set("cassandra.consistencylevel.read", "QUORUM");

//thrift input job configurations
ConfigHelper.setInputRpcPort(job.getConfiguration(), "9160");
ConfigHelper.setInputInitialAddress(job.getConfiguration(), HOST);
ConfigHelper.setInputPartitioner(job.getConfiguration(), "RandomPartitioner");

SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes("classification")));
//ConfigHelper.setRangeBatchSize(job.getConfiguration(), 2048);
ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate);

//specification for mapper
job.setMapperClass(MyMapper.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

//specifications for reducer (writing to files)
job.setReducerClass(ReducerToFileSystem.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//job.setOutputFormatClass(MyCdbWriter1.class);
job.setNumReduceTasks(1);

//set output path for storing output files
Path filePath = new Path(OUTPUT_DIR);
FileSystem hdfs = FileSystem.get(getConf());
if(hdfs.exists(filePath)){
    hdfs.delete(filePath, true);
}
MyCdbWriter1.setOutputPath(job, new Path(OUTPUT_DIR));

MultipleOutputs.addNamedOutput(job, "cdb1', MyCdbWriter1.class, Text.class, Text.class);
MultipleOutputs.addNamedOutput(job, "cdb2", MyCdbWriter2.class, Text.class, Text.class);

boolean success = job.waitForCompletion(true);
return success ? 0:1;

public static class ReducerToFileSystem extends Reducer<Text, Text, Text, Text>
{
    private MultipleOutputs<Text, Text> mos;

    public void setup(Context context){
        mos = new MultipleOutputs<Text, Text>(context);
    }

    //public void reduce(Text key, Text value, Context context) 
    //throws IOException, InterruptedException (This was the mistake, changed the signature and it worked fine)
    public void reduce(Text key, Iterable<Text> values, Context context)
    throws IOException, InterruptedException
    {
        //context.write(key, value);
        mos.write("cdb1", key, value, OUTPUT_DIR+"/"+"cdb1");
        mos.write("cdb2", key, value, OUTPUT_DIR+"/"+"cdb2");
        context.progress();
    }

    public void cleanup(Context context) throws IOException, InterruptedException {
        mos.close();
    }
}

public class MyCdbWriter1<K, V> extends FileOutputFormat<K, V> 
{
    @Override
    public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException 
    {
    }

    public static void setOutputPath(Job job, Path outputDir) {
        job.getConfiguration().set("mapred.output.dir", outputDir.toString());
    }

    protected static class CdbDataRecord<K, V> extends RecordWriter<K, V>
    {
        @override
        write()
        close()
    }
}

最佳答案

我在调试后发现我的错误，我的 reduce 方法从未被调用过。我发现我的函数定义与 API 的定义不匹配，将其从 public void reduce(Text key, Text value, Context context) 更改为至 public void reduce(Text key, Iterable<Text> values, Context context) .我不知道为什么 reduce 方法没有 @Override 标签，它可以防止我的错误。

关于当文件格式为自定义格式时，Hadoop MultipleOutputs 不会写入多个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12981233/

25

4

0

文章推荐： java - 在java中一次设置属性文件中的所有属性

文章推荐： javascript - 如何从 AngularJS 指令中选择动态生成的元素？

文章推荐： c - x64 操作系统上 x32 ELF 的除法性能

文章推荐： jquery - 在移动设备上滚动内容溢出的固定 div

hadoop - MultipleOutputs map 减少不起作用
我正在尝试将(键和值)写入reducer的不同文件中，但是我只有一个键和值的输出文件。 public static class Reduce extends Reducer { priva
hadoop - MultipleOutputs 具有不同的 FileOutputFormat
我正在尝试使用 MultipleOutputs 编写多个输出文件。然而，我希望我的 FileOutputFormat 具有两种不同的格式，即不同文件的 Text 和 SequenceFileForma
hadoop - MRUnit 不适用于 MultipleOutputs
当我运行带有 MultipleOutputs 的基本 MRUnit 时，出现以下异常: java.lang.NullPointerException at org.apache.hadoop.fs.P
hadoop - MultipleOutputs 不写入文件，我做错了什么？
我基本上是在尝试将我自己的 Tab 分隔值行写到 3 个不同的输出文件中。尽管定义和编写了不同的命名输出，但所有文件仍被命名为“part-r-*” 所有代码都经过匿名和压缩驱动类如下所示: // S
Hadoop Mapreduce MultipleOutputs 输出控制台
当我运行带有和不带有 multipleOutputs 的 mapreduce 时，控制台日志之间存在差异。我有一个仅输出到文本文件的映射器作业。没有配置 MultipleOutputs，我的映射
hadoop MultipleOutputs 到绝对路径，但其他尝试已创建文件
我使用 MultipleOutputs 将数据输出到一些绝对路径，而不是相对于 OutputPath 的路径。然后，我得到错误: Error: org.apache.hadoop.ipc.Remot
java - MultipleOutputs 与 SideEffectFile
我想知道 MultipleOutputs 与 FSDataOutputStream 与 Task Side Effect File 之间在创建不同输出文件方面的优势/差异？一个。使用多重输出: Mu
当文件格式为自定义格式时，Hadoop MultipleOutputs 不会写入多个文件
我正在尝试从 cassandra 中读取并使用 MultipleOutputs api(Hadoop 版本 1.0.3)将 reducers 输出写入多个输出文件。在我的案例中，文件格式是扩展 Fil
java - 使用 MapReduce MultipleOutputs 清空输出文件
我在我的 Reducer 中使用 MultipleOutputs，因为我想为每个键创建单独的结果文件，但是，尽管创建了默认结果文件 part-r-xxxx 并包含正确的值，但每个结果文件都是空的。这
java - 使用 MultipleOutputs 时如何在 Hadoop 中命名文件？
我正在使用 MultipleOutputs 编写三个文件，即名称、属性和其他文件，并使用 6 个 redcuer。我在我的输出目录中得到这些文件: attrib-r-00003 name-r-000
Hadoop MultipleOutputs 输出文件 "part-day-26"
我在 mapreduce 作业中遇到问题，我希望输出文件的格式为 file-day-26而不是 part-r-00000 . 我已尝试使用 addNamedOutput 方法来完成此操作( Multi
hadoop - 较新 api 中 hadoop 中的 MultipleOutputs
我写了一个简单的字数统计程序并试图得到输出基于较新API格式的Multipleoutputs，我得到了输出输出文件中的数据(带有键的名称): import java.io.IOException;
hadoop - 为什么 MultipleOutputs 不适用于此 Map Reduce 程序？
我有一个 Mapper 类，它提供一个文本键和 IntWritable 值，可以是 1 2 或 3。根据这些值，我必须用不同的 key 编写三个不同的文件。我得到一个没有记录的单一文件输出。另外，是否
具有 FileAlreadyExistsException 的 Reducer 中的 Hadoop MultipleOutputs
我在 reducer 中使用 MultipleOutputs。多重输出会将文件写入名为 NewIdentities 的文件夹。代码如下所示: private MultipleOutputs mos;
java - Hadoop MultipleOutputs.addNamedOutput 抛出 "cannot find symbol"
我正在使用 Hadoop 0.20.203.0。我想输出到两个不同的文件，所以我试图让 MultipleOutputs 工作。这是我的配置方法: public static void main(St
java - Hadoop - MultipleOutputs.write - OutofMemory - Java 堆空间
我正在编写一个处理许多文件并从每个文件创建多个文件的 hadoop 作业。我正在使用“MultipleOutputs”来编写它们。它适用于较少数量的文件，但我收到大量文件的以下错误。在 Multipl
Hadoop:如何将 MultipleOutputs 发送到 2 个不同的路径/文件系统？
我已将 MultipleOutputs 配置为生成 2 个命名输出。我想发送一个到 s3n:// 和一个到 hdfs:// 这可能吗？最佳答案目前可用的 API 无法做到这一点。 Multiple
java - 在 MultipleOutputs 中 - 避免将我的 key 写入文件
您好，我正在使用 Hadoop mapreduce，我正在使用多输出。下面是我的代码 mos = new MultipleOutputs(context); mos.write(key, value,
hadoop - 使用 MultipleOutputs 在 MapReduce 中写入 HBase
我目前有一个 MapReduce 作业，它使用 MultipleOutputs 将数据发送到多个 HDFS 位置。完成后，我使用 HBase 客户端调用(在 MR 之外)将一些相同的元素添加到几个 H
hadoop - 在 Hadoop 中使用 MultipleOutputs 时 GZIP 文件末尾损坏
我正在压缩 Hadoop MR 作业的输出: conf.setOutputFormat(TextOutputFormat.class); TextOutputFormat.setCompressOut

首页

博学

6Ren·AI

商城

当文件格式为自定义格式时，Hadoop MultipleOutputs 不会写入多个文件