gpt4 book ai didi

hadoop - 在R上使用rmr软件包时,Hadoop流作业失败并丢失选项错误

转载 作者:行者123 更新时间:2023-12-02 21:40:43 25 4
gpt4 key购买 nike

我正在尝试在Amazon EMR的Rstudio中使用rmr包将数据帧从R写入HDFS。
我正在关注的教程是
http://blogs.aws.amazon.com/bigdata/post/Tx37RSKRFDQNTSL/Statistical-Analysis-with-Open-Source-R-and-RStudio-on-Amazon-EMR

我写的代码是

    Sys.setenv(HADOOP_CMD="/home/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/home/hadoop/contrib/streaming/hadoop-streaming.jar")
Sys.setenv(JAVA_HOME="/usr/java/latest/jre")

# load librarys
library(rmr2)
library(rhdfs)
library(plyrmr)

# initiate rhdfs package
hdfs.init()

# a very simple plyrmr example to test the package
library(plyrmr)
# running code localy
bind.cols(mtcars, carb.per.cyl = carb/cyl)
# same code on Hadoop cluster
to.dfs(mtcars, output="/tmp/mtcars")

我正在关注此代码教程:
https://github.com/awslabs/emr-bootstrap-actions/blob/master/R/Hadoop/examples/biganalyses_example.R

Tha hadoop版本是Cloudera CDH5。我还适当设置了环境变量。

在运行上述代码时,出现以下错误:
    > to.dfs(data,output="/tmp/cust_seg")
15/03/09 20:00:21 ERROR streaming.StreamJob: Missing required options: input, output
Usage: $HADOOP_HOME/bin/hadoop jar \
$HADOOP_HOME/hadoop-streaming.jar [options]
Options:
-input <path> DFS input file(s) for the Map step
-output <path> DFS output directory for the Reduce step
-mapper <cmd|JavaClassName> The streaming command to run
-combiner <JavaClassName> Combiner has to be a Java class
-reducer <cmd|JavaClassName> The streaming command to run
-file <file> File/dir to be shipped in the Job jar file
-inputformat TextInputFormat(default)|SequenceFileAsTextInputFormat|JavaClassName Optional.
-outputformat TextOutputFormat(default)|JavaClassName Optional.
-partitioner JavaClassName Optional.
-numReduceTasks <num> Optional.
-inputreader <spec> Optional.
-cmdenv <n>=<v> Optional. Pass env.var to streaming commands
-mapdebug <path> Optional. To run this script when a map task fails
-reducedebug <path> Optional. To run this script when a reduce task fails
-verbose

Generic options supported are
-conf <configuration file> specify an application configuration file
-D <property=value> use value for given property
-fs <local|namenode:port> specify a namenode
-jt <local|resourcemanager:port> specify a ResourceManager
-files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.
-archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]


For more details about these options:
Use $HADOOP_HOME/bin/hadoop jar build/hadoop-streaming.jar -info

Streaming Job Failed!

我不知道该问题的解决方案。如果有人能尽快提供帮助,我们将不胜感激。

最佳答案

由于未在您的代码中正确设置HADOOP_STREAMING环境变量而导致该错误。您应该指定完整路径以及jar文件名。下面的R代码对我来说很好用。

R代码(我使用的是hadoop 2.4.0)

Sys.setenv("HADOOP_CMD"="/usr/local/hadoop/bin/hadoop")
Sys.setenv("HADOOP_STREAMING"="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.4.0.jar")

# load librarys
library(rmr2)
library(rhdfs)

# initiate rhdfs package
hdfs.init()

# a very simple plyrmr example to test the package
library(plyrmr)

# running code localy
bind.cols(mtcars, carb.per.cyl = carb/cyl)

# same code on Hadoop cluster
to.dfs(mtcars, output="/tmp/mtcars")

# list the files of tmp folder
hdfs.ls("/tmp")

permission owner group size modtime file
1 -rw-r--r-- manohar supergroup 1685 2015-03-22 16:12 /tmp/mtcars

希望这可以帮助。

关于hadoop - 在R上使用rmr软件包时,Hadoop流作业失败并丢失选项错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28975189/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com