gpt4 book ai didi

scala - 在本地加载 Spark 数据 不完整的 HDFS URI

转载 作者:行者123 更新时间:2023-12-04 17:08:53 24 4
gpt4 key购买 nike

我在本地 CSV 文件中遇到了 SBT 加载问题。基本上,我在 Scala Eclipse 中编写了一个 Spark 程序,它读取以下文件:

val searches = sc.textFile("hdfs:///data/searches")

这在 hdfs 上工作正常,但出于调试原因,我希望从本地目录加载此文件,我已将其设置在项目目录中。

所以我厌倦了以下几点:
val searches = sc.textFile("file:///data/searches")
val searches = sc.textFile("./data/searches")
val searches = sc.textFile("/data/searches")

这些都不允许我从本地读取文件,并且所有这些都在 SBT 上返回此错误:
Exception in thread "main" java.io.IOException: Incomplete HDFS URI, no host: hdfs:/data/pages
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:143)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2397)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.FlatMappedRDD.getPartitions(FlatMappedRDD.scala:30)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1135)
at org.apache.spark.rdd.RDD.count(RDD.scala:904)
at com.user.Result$.get(SparkData.scala:200)
at com.user.StreamingApp$.main(SprayHerokuExample.scala:35)
at com.user.StreamingApp.main(SprayHerokuExample.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

在错误报告中,在 com.user.Result$.get(SparkData.scala:200) 是调用 sc.textFile 的行。它似乎默认运行在 Hadoop 环境中。我可以做些什么来在本地读取此文件?

编辑:在本地时,我重新配置了 build.sbt:
submit <<= inputTask{(argTask:TaskKey[Seq[String]]) => {
(argTask,mainClass in Compile,assemblyOutputPath in assembly,sparkHome) map {
(args,main,jar,sparkHome) => {
args match {
case List(output) => {
val sparkCmd = sparkHome+"/bin/spark-submit"
Process(
sparkCmd :: "--class" :: main.get :: "--master" :: "local[4]" ::
jar.getPath :: "local[4]" :: output :: Nil)!
}
case _ => Process("echo" :: "Usage" :: Nil) !
}
}

}}}

submit 命令是我用来运行代码的命令。

找到的解决方案:所以事实证明 file:///path/是正确的方法,但在我的情况下,完整路径有效:即 home/projects/data/searches。虽然只是放置数据/搜索没有(尽管在 home/projects 目录下工作)。

最佳答案

用:

val searches = sc.textFile("hdfs://host:port_no/data/searches")

默认
host: master
port_no: 9000

关于scala - 在本地加载 Spark 数据 不完整的 HDFS URI,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29079396/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com