gpt4 book ai didi

hadoop - HBase,Map/Reduce和SequenceFiles:mapred.output.format.class与新的Map API模式不兼容

转载 作者:行者123 更新时间:2023-12-02 21:49:42 24 4
gpt4 key购买 nike

我试图从HBase表中产生mahout vector 。 Mahout需要 vector 的序列文件作为其输入。我得到的印象是我无法从使用HBase作为源的map-reduce作业中写入序列文件。这里什么也没有:

public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
JobConf jobConf = new JobConf();
jobConf.setMapOutputKeyClass(LongWritable.class);
jobConf.setMapOutputValueClass(VectorWritable.class);

// we want the vectors written straight to HDFS,
// the order does not matter.
jobConf.setNumReduceTasks(0);

jobConf.setOutputFormat(SequenceFileOutputFormat.class);

Path outputDir = new Path("/home/cloudera/house_vectors");
FileSystem fs = FileSystem.get(configuration);
if (fs.exists(outputDir)) {
fs.delete(outputDir, true);
}

FileOutputFormat.setOutputPath(jobConf, outputDir);

// I want the mappers to know the max and min value
// so they can normalize the data.
// I will add them as properties in the configuration,
// by serializing them with avro.
String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
maximumHouse));
jobConf.set("minmax", minmax);

Job job = Job.getInstance(jobConf);
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("data"));
TableMapReduceUtil.initTableMapperJob("homes", scan,
HouseVectorizingMapper.class, LongWritable.class,
VectorWritable.class, job);

job.waitForCompletion(true);
}

我有一些测试代码可以运行它,但是我得到了:
java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

所以我认为我的问题是我正在使用导入org.apache.hadoop.mapreduce.Job,而setOutputFormat方法需要一个类org.apache.hadoop.mapreduce.OutputFormat的实例。该类只有四个实现,它们都不是用于序列文件的。这是它的javadocs:

http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/OutputFormat.html

如果可以的话,我将使用Job类的旧API版本,但是HBase的TableMapReduceUtil仅接受新API的Job。

我想我可以先将结果写为文本,然后再执行第二个map / reduce作业,将输出转换为序列文件,但这听起来效率很低。

还有旧的org.apache.hadoop.hbase.mapred.TableMapReduceUtil,但对我而言已不推荐使用。

我的mahout jar版本是0.7-cdh4.5.0
我的HBase jar版本是0.94.6-cdh4.5.0
我所有的hadoop jar 都是2.0.0-cdh4.5.0

在我的情况下,有人可以告诉我如何从M / R写入SequenceFile吗?

最佳答案

实际上,SequenceFileOutputFormat是新OutputFormat的后代。您不仅需要查找Javadoc中的直接子类,还需要进一步寻找。

http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/SequenceFileOutputFormat.html

您可能在驱动程序类中导入了错误的(旧的)。由于您尚未在代码示例中包含导入内容,因此无法从您的问题中确定这一点。

关于hadoop - HBase,Map/Reduce和SequenceFiles:mapred.output.format.class与新的Map API模式不兼容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22138664/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com