gpt4 book ai didi

scala - 在 yarn 簇上使用带有管道的 addFile

转载 作者:行者123 更新时间:2023-12-01 02:11:49 33 4
gpt4 key购买 nike

我一直在成功地将 pyspark 与我的 YARN 集群一起使用。我的工作
这样做涉及使用 RDD 的管道命令通过二进制发送数据
我做了。我可以像这样在 pyspark 中轻松做到这一点(假设 'sc' 是
已经定义):

sc.addFile("./dumb_prog") 
t= sc.parallelize(range(10))
t.pipe("dumb_prog")
t.take(10) # Gives expected result

但是,如果我在 Scala 中做同样的事情,管道命令会得到一个 'Cannot
运行程序 "dumb_prog": error=2, No such file or directory' 错误。这是
Scala shell 中的代码:
sc.addFile("./dumb_prog")
val t = sc.parallelize(0 until 10)
val u = t.pipe("dumb_prog")
u.take(10)

为什么这只适用于 Python 而不适用于 Scala?有没有办法
让它在 Scala 中工作?

以下是 Scala 方面的完整错误消息:
[59/3965]
14/09/29 13:07:47 INFO SparkContext: Starting job: take at <console>:17
14/09/29 13:07:47 INFO DAGScheduler: Got job 3 (take at <console>:17) with 1
output partitions (allowLocal=true)
14/09/29 13:07:47 INFO DAGScheduler: Final stage: Stage 3(take at
<console>:17)
14/09/29 13:07:47 INFO DAGScheduler: Parents of final stage: List()
14/09/29 13:07:47 INFO DAGScheduler: Missing parents: List()
14/09/29 13:07:47 INFO DAGScheduler: Submitting Stage 3 (PipedRDD[3] at pipe
at <console>:14), which has no missing parents
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(2136) called with
curMem=7453, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3 stored as values in
memory (estimated size 2.1 KB, free 265.4 MB)
14/09/29 13:07:47 INFO MemoryStore: ensureFreeSpace(1389) called with
curMem=9589, maxMem=278302556
14/09/29 13:07:47 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes
in memory (estimated size 1389.0 B, free 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on 10.10.0.20:37574 (size: 1389.0 B, free: 265.4 MB)
14/09/29 13:07:47 INFO BlockManagerMaster: Updated info of block
broadcast_3_piece0
14/09/29 13:07:47 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3
(PipedRDD[3] at pipe at <console>:14)
14/09/29 13:07:47 INFO YarnClientClusterScheduler: Adding task set 3.0 with
1 tasks
14/09/29 13:07:47 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID
6, SERVERNAME, PROCESS_LOCAL, 1201 bytes)
14/09/29 13:07:47 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory
on SERVERNAME:57118 (size: 1389.0 B, free: 530.3 MB)
14/09/29 13:07:47 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6,
SERVERNAME): java.io.IOException: Cannot run program "dumb_prog": error=2,
No such file or directory
java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)

最佳答案

我在 Yarn 客户端模式下的 spark 1.3.0 中遇到了类似的问题。当我查看应用程序缓存目录时,即使使用 --files,该文件也永远不会被推送到执行程序。 .但是当我添加以下内容时,它确实推送到了每个执行程序:

sc.addFile("dumb_prog",true)
t.pipe("./dumb_prog")

我认为这是一个错误,但以上让我解决了这个问题。

关于scala - 在 yarn 簇上使用带有管道的 addFile,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28688886/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com