gpt4 book ai didi

hadoop - 在 Amazon EC2 上将 HDFS 与 Apache Spark 结合使用

转载 作者:可可西里 更新时间:2023-11-01 16:31:16 25 4
gpt4 key购买 nike

我使用 spark EC2 脚本设置了一个 spark 集群。我设置了集群,现在正尝试将文件放在 HDFS 上,这样我的集群就可以正常工作。

在我的主机上,我有一个文件 data.txt。我通过 ephemeral-hdfs/bin/hadoop fs -put data.txt/data.txt

将它添加到 hdfs

现在,在我的代码中,我有:

JavaRDD<String> rdd = sc.textFile("hdfs://data.txt",8);

执行此操作时出现异常:

Exception in thread "main" java.net.UnknownHostException: unknown host: data.txt
at org.apache.hadoop.ipc.Client$Connection.<init>(Client.java:214)
at org.apache.hadoop.ipc.Client.getConnection(Client.java:1196)
at org.apache.hadoop.ipc.Client.call(Client.java:1050)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225)
at com.sun.proxy.$Proxy6.getProtocolVersion(Unknown Source)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:396)
at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:379)
at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:119)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:238)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:203)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1386)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1404)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:176)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:208)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:203)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:217)
at org.apache.spark.RangePartitioner.<init>(Partitioner.scala:123)
at org.apache.spark.rdd.OrderedRDDFunctions.sortByKey(OrderedRDDFunctions.scala:62)
at org.apache.spark.rdd.RDD.sortBy(RDD.scala:488)
at org.apache.spark.api.java.JavaRDD.sortBy(JavaRDD.scala:188)
at SimpleApp.sortBy(SimpleApp.java:118)
at SimpleApp.main(SimpleApp.java:30)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

如何正确地将此文件放入 HDFS,以便我可以使用我的集群开始处理数据集?我也试过只添加本地文件路径,例如:

JavaRDD<String> rdd = sc.textFile("/home/ec2-user/data.txt",8);

当我这样做并提交作业时:

./spark/bin/spark-submit --class SimpleApp --master spark://ec2-xxx.amazonaws.com:7077 --total-executor-cores 8 /home/ec2-user/simple-project-1.0.jar

我只有一个执行者,集群中的工作节点似乎没有参与。我认为这是因为我使用的是本地文件,而 ec2 没有 NFS。

最佳答案

因此,您需要在 hdfs://data.txt 中的//之后提供的第一部分是主机名,因此它将是 hdfs://{active_master}: 9000/data.txt(如果将来有用,用于持久性 hdfs 的 spark-ec2 脚本的默认端口是 9010)。

关于hadoop - 在 Amazon EC2 上将 HDFS 与 Apache Spark 结合使用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30702212/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com