azure - 如何通过动态资源分配来运行spark + cassandra + mesos (dcos)？

转载作者：行者123 更新时间：2023-12-03 03:09:11

通过 Marathon，我们在每个从属节点上运行 MesosExternalShuffleService。当我们在粗粒度模式下通过 dcos CLI 提交 Spark 作业而无需动态分配时，一切都会按预期工作。但是，当我们通过动态分配提交相同的作业时，它会失败。

16/12/08 19:20:42 ERROR OneForOneBlockFetcher: Failed while starting block fetches
java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file:/tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index
at   org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
...
Caused by: java.io.FileNotFoundException: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index (No such file or directory)

完整描述:

我们使用 Azure 门户将 Mesos (DCOS) 与 Marathon 一起安装。
通过我们安装的 Universe 软件包:Cassandra、Spark 和 Marathon-lb
我们在 Cassandra 中生成了测试数据。
在笔记本电脑上我安装了 dcos CLI

当我提交如下作业时，一切都按预期工作:

./dcos spark run --submit-args="--properties-file coarse-grained.conf --class portal.spark.cassandra.app.ProductModelPerNrOfAlerts http://marathon-lb-default.marathon.mesos:10018/jars/spark-cassandra-assembly-1.0.jar"
Run job succeeded. Submission id: driver-20161208185927-0043

cqlsh:sp> select count(*) from product_model_per_alerts_by_date ;

count
-------
476

粗粒度.conf:

spark.cassandra.connection.host 10.32.0.17
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.cores 1
spark.executor.memory 1g
spark.executor.instances 2
spark.submit.deployMode cluster
spark.cores.max 4

portal.spark.cassandra.app.ProductModelPerNrOfAlerts:

package portal.spark.cassandra.app

import org.apache.spark.sql.{SQLContext, SaveMode}
import org.apache.spark.{SparkConf, SparkContext}

object ProductModelPerNrOfAlerts {
  def main(args: Array[String]): Unit = {

     val conf = new SparkConf(true)
                    .setAppName("cassandraSpark-ProductModelPerNrOfAlerts")

     val sc = new SparkContext(conf)

     val sqlContext = new SQLContext(sc)

     import sqlContext.implicits._

     val df = sqlContext
             .read
             .format("org.apache.spark.sql.cassandra")
             .options(Map("table" -> "asset_history", "keyspace" -> "sp"))
            .load()
            .select("datestamp","product_model","nr_of_alerts")

     val dr = df
           .groupBy("datestamp","product_model")
           .avg("nr_of_alerts")
           .toDF("datestamp","product_model","nr_of_alerts")

     dr.write
          .mode(SaveMode.Overwrite)
          .format("org.apache.spark.sql.cassandra")
          .options(Map("table" -> "product_model_per_alerts_by_date", "keyspace" -> "sp"))
          .save()


     sc.stop()
 }
}

动态分配

通过 Marathon 我们运行 Mesos 外部随机播放服务:

{
  "id": "spark-mesos-external-shuffle-service-tt",
  "container": {
     "type": "DOCKER",
     "docker": {
        "image": "jpavt/mesos-spark-hadoop:mesos-external-shuffle-service-1.0.4-2.0.1",
        "network": "BRIDGE",
        "portMappings": [
           { "hostPort": 7337, "containerPort": 7337, "servicePort": 7337 }
         ],
       "forcePullImage":true,
       "volumes": [
         {
           "containerPath": "/tmp",
           "hostPath": "/tmp",
           "mode": "RW"
         }
       ]
     }
   },
   "instances": 9,
   "cpus": 0.2,
   "mem": 512,
   "constraints": [["hostname", "UNIQUE"]]
 }

jpavt/mesos-spark-hadoop 的 Dockerfile:mesos-external-shuffle-service-1.0.4-2.0.1:

FROM mesosphere/spark:1.0.4-2.0.1
WORKDIR /opt/spark/dist
ENTRYPOINT ["./bin/spark-class", "org.apache.spark.deploy.mesos.MesosExternalShuffleService"]

现在，当我提交动态分配作业时，它失败了:

./dcos spark run --submit-args="--properties-file dynamic-allocation.conf --class portal.spark.cassandra.app.ProductModelPerNrOfAlerts http://marathon-lb-default.marathon.mesos:10018/jars/spark-cassandra-assembly-1.0.jar"
 Run job succeeded. Submission id: driver-20161208191958-0047

select count(*) from product_model_per_alerts_by_date ;

count
-------
 5

动态分配.conf:

spark.cassandra.connection.host 10.32.0.17
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.cores 1
spark.executor.memory 1g
spark.submit.deployMode cluster
spark.cores.max 4

spark.shuffle.service.enabled true
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.maxExecutors 5
spark.dynamicAllocation.cachedExecutorIdleTimeout 120s
spark.dynamicAllocation.schedulerBacklogTimeout 10s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 20s
spark.mesos.executor.docker.volumes /tmp:/tmp:rw
spark.local.dir /tmp

来自 mesos 的日志:

16/12/08 19:20:42 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 18.0 KB, free 366.0 MB)
16/12/08 19:20:42 INFO TorrentBroadcast: Reading broadcast variable 7 took 21 ms
16/12/08 19:20:42 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 38.6 KB, free 366.0 MB)
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 0, fetching them
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://<a href="https://stackoverflow.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="6d200c1d2218191d1819391f0c0e06081f2d5c5d435e5f435d4359" rel="noreferrer noopener nofollow">[email protected]</a>:45422)
16/12/08 19:20:42 INFO MapOutputTrackerWorker: Got the output locations
16/12/08 19:20:42 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks out of 58 blocks
16/12/08 19:20:42 INFO TransportClientFactory: Successfully created connection to /10.32.0.11:7337 after 2 ms (0 ms spent in bootstraps)
16/12/08 19:20:42 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 13 ms
16/12/08 19:20:42 ERROR OneForOneBlockFetcher: Failed while starting block fetches java.lang.RuntimeException: java.lang.RuntimeException: Failed to open file: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index
at   org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
...
 Caused by: java.io.FileNotFoundException: /tmp/blockmgr-d4df5df4-24c9-41a3-9f26-4c1aba096814/30/shuffle_0_0_0.index (No such file or directory)

来自马拉松 Spark-mesos-external-shuffle-service-tt 的日志:

...
16/12/08 19:20:29 INFO MesosExternalShuffleBlockHandler: Received registration request from app 704aec43-1aa3-4971-bb98-e892beeb2c45-0008-driver-20161208191958-0047 (remote address /10.32.0.4:49710, heartbeat timeout 120000 ms).
16/12/08 19:20:31 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=704aec43-1aa3-4971-bb98-e892beeb2c45-0008-driver-20161208191958-0047, execId=2} with ExecutorShuffleInfo{localDirs=[/tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
16/12/08 19:20:38 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 8157825166903585542
java.lang.RuntimeException: Failed to open file: /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index
at org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getSortBasedShuffleBlockData(ExternalShuffleBlockResolver.java:234)
...
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
Caused by: java.io.FileNotFoundException: /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index (No such file or directory)
...

但文件存在于给定的从属盒上:

$ ls -l /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index
-rw-r--r-- 1 root root 1608 Dec  8 19:20 /tmp/blockmgr-14525ef0-22e9-49fb-8e81-dc84e5fba8b2/16/shuffle_0_55_0.index


 stat shuffle_0_55_0.index 
  File: 'shuffle_0_55_0.index'
  Size: 1608        Blocks: 8          IO Block: 4096   regular file
  Device: 801h/2049d    Inode: 1805493     Links: 1
  Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
  Access: 2016-12-08 19:20:38.163188836 +0000
  Modify: 2016-12-08 19:20:38.163188836 +0000
  Change: 2016-12-08 19:20:38.163188836 +0000
  Birth: -

最佳答案

虽然我不熟悉 DCOS、Marathon 和 Azure，但我在 Mesos 和 Aurora 上使用 Docker 进行动态资源分配(Mesos external shuffle 服务)。

每个 Mesos Agent 节点都有自己的外部 Shuffle 服务(即，一个 Mesos Agent 对应一个外部 Shuffle 服务)？
spark.local.dir 设置完全相同的字符串并指向相同的目录？不过，您用于随机播放服务的 spark.local.dir 是 /tmp，但我不知道 DCOS 设置。
spark.local.dir 目录对于两者都是可读/可写的？如果 mesos Agent 和外部 shuffle 服务都由 docker 启动，则主机上的 spark.local.dir 必须安装到两个容器。

编辑

如果设置了SPARK_LOCAL_DIRS(mesos 或独立)环境变量，spark.local.dir 将被覆盖。

关于azure - 如何通过动态资源分配来运行spark + cassandra + mesos (dcos)？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41054952/

文章推荐： azure - 诺克斯和 Falcon 谈 HDInsight

文章推荐： excel - With 函数 VBA 中的多对象表达式

文章推荐： asp.net - 无法将 Azure Web 应用连接到 Azure 数据库

centos - 将 DCOS 代理节点重新附加到 DCOS
我们有一个 DCOS 1.9.0 集群，设置了 11 个公共(public)代理节点 (CentOS)。然而，在过去几天的某个时候，其中一个代理节点分离了，它在 DCOS UI 中不可用。我能够通过
dcos - 无法在 Mac OSX 上安装 dcos-cli
DCOS提供如下安装命令来安装cli工具: mkdir -p dcos && cd dcos && curl -O https://downloads.dcos.io/dcos-cli/install
linux - 无法使用 dcos 节点 ssh --master-proxy --leader sshing 进入 dcos 节点
我是 dcos Mesos 的新手，在本地 Ubuntu 机器上安装了 dc os。我可以查看 dcos 仪表板。但我无法使用 dcos node ssh --master-proxy --lea
python - dcos cassandra子命令错误
似乎无法安装 Cassandra 软件包，marathon 在第 1/2 阶段的部署中陷入困境，并且 dcos cassandra 子命令发出以下堆栈跟踪，感谢任何帮助。 Traceback (mos
docker - zsh:找不到命令:dcos
我尝试通过dcos-vagrant在Mac上安装DC / OS集群。安装非常顺畅，没有发现错误。 ==> m1: sudo: chmod u+x /opt/mesosphere/bin/postfl
marathon - 将马拉松组安装为 DCOS 包
我们正在尝试创建自己的 DCOS 包来安装我们的应用程序，我们创建了自己的 Universe 并将其托管在 S3 中，我们为 DCOS 包创建了所有必要的文件(config.json、package.
仅支持 IPv6 的 DCOS
我在 CentOS7 上使用 IPv6 网络在 CLI 模式下安装 DCOS ver 1.8.7，出现以下粗体错误 -- [root@dcos-bootstrap centos]# ./dcos_ge
azure - 如何通过动态资源分配来运行spark + cassandra + mesos (dcos)？
通过 Marathon，我们在每个从属节点上运行 MesosExternalShuffleService。当我们在粗粒度模式下通过 dcos CLI 提交 Spark 作业而无需动态分配时，一切都会按
apache-kafka - 如何从 DCOS Kafka 中删除主题？
我们通过 DCOS 运行 Kafka。我们有一个主题似乎有一些错误的数据。我想删除那个话题。尝试运行: kafka-topics.sh --zookeeper --delete --topic 这表
azure - ACS (DCOS) 中除 80 之外的任何其他端口上的应用
如何在 ACS (DCOS) 中的除 80 之外的任何其他端口上托管应用程序？我可以提供任何其他 URL 而不是使用端口号来访问吗？ { "id": "/dockercloud-hello-worl
apache-kafka - Mesos DCOS 未安装 Kafka
我正在尝试在 Mesos 上安装 Kafka。看来安装已经成功了。 vagrant@DevNode:/dcos$ dcos package install kafka This will instal
mesos - DCOS 安装过程是否与现有 Mesos 安装相同，还是我们需要从头开始？
我们有一个现有的 Apache Mesos 集群，并希望以其崭新的开源形式试用 DCOS。但是，破坏性地重新安装 DCOS 会很痛苦。那么是否可以在现有的 Mesos 安装上“覆盖”DCOS？ DCO
ssl - 使用 DCOS 负载均衡器设置 SSL 终止
我正在尝试在 Azure 上实现类似于 AWS 上的 ALB 的功能。您可以在其中将证书分配给 ALB，并且可以使用 https。在负载均衡器处终止 SSL，然后将请求作为正常的 http 请求转发到
mesos - 使用 mesos dcos cli 提交 Spark
我正在尝试使用 DCOS cli 在 mesos 上启动一个 spark streaming 作业。我可以开始工作了。我的程序需要一个配置文件作为 cli 参数传递。如何使用 dcos spark r
linux - Apache Mesos、Mesosphere 和 DCOS 之间有什么区别？
在我看来，Apache Mesos 是一个分布式系统内核，而 Mesosphere 是基于 Apache Mesos 的 Linux 发行版。例如，Linux Kernel(Apache Mesos
amazon-web-services - DCOS Cloudformation 中 Mesos 代理的自定义代理角色
我一直在 AWS 上试验 DCOS 的默认云形成脚本。我无法找出定义代理类型的方法或为 mesos 代理定义自定义角色的方法最佳答案 DC/OS 有两个高级角色，即 Mesos 专用代理和 Meso
apache-kafka - dcos-kafka-service 和 mesos-kafka 的区别
我正在做一个 POC，将 Kafka 作为一个应用程序部署在 Mesos 集群上。我在 github 上看到了这两个代码库。一个由 apache-mesos ( github page ) 开发，另一
amazon-web-services - AWS 上的 Mesosphere DCOS 集群。 EC2 实例终止并在停止后再次重新启动
我已经使用 DCOS 模板在 AWS 上创建了一个 Mesosphere DCOS 集群。我想在一天结束后停止这些实例。但是在停止实例后，它们将被终止并被新实例替换。请建议如何停止实例。如果 EC2
amazon-web-services - DCOS Mesos 代理无法从 S3 存储桶中检索私有(private)资源
我正在尝试通过 universe 包在 AWS 的 DC/OS 堆栈上部署应用程序。据我了解，mesos fetcher 将尝试检索这些资源并按照 marathon.json 中的定义将它们容器化。这
docker - 在 DCOS/Marathon 和 docker 上发现 Hazelcast TCP/IP
我在马拉松上部署了一个 dockerized dropwizard 服务。我使用 Hazelcast 作为分布式缓存，我开始使用它是我的 dropwizard 服务的一部分。我设置了一个约束来确保每个

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

azure - 如何通过动态资源分配来运行spark + cassandra + mesos (dcos)？

完整描述:

动态分配