gpt4 book ai didi

performance - Spark 示例程序运行很慢

转载 作者:行者123 更新时间:2023-12-04 15:09:24 27 4
gpt4 key购买 nike

我尝试使用 Spark 来解决简单的图形问题。我在 Spark 源文件夹中找到了一个示例程序:transitive_closure.py,它在一个不超过 200 个边和顶点的图中计算传递闭包。但是在我自己的笔记本电脑中,它运行了 10 多分钟并且没有终止。我使用的命令行是:spark-submit transitionive_closure.py。

我想知道为什么即使计算这么小的传递闭包结果,spark 也这么慢?这是一个常见的情况吗?有没有我想念的配置?

该程序如下所示,可以在他们网站的 spark install 文件夹中找到。

from __future__ import print_function

import sys
from random import Random

from pyspark import SparkContext

numEdges = 200
numVertices = 100
rand = Random(42)


def generateGraph():
edges = set()
while len(edges) < numEdges:
src = rand.randrange(0, numEdges)
dst = rand.randrange(0, numEdges)
if src != dst:
edges.add((src, dst))
return edges


if __name__ == "__main__":
"""
Usage: transitive_closure [partitions]
"""
sc = SparkContext(appName="PythonTransitiveClosure")
partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2
tc = sc.parallelize(generateGraph(), partitions).cache()

# Linear transitive closure: each round grows paths by one edge,
# by joining the graph's edges with the already-discovered paths.
# e.g. join the path (y, z) from the TC with the edge (x, y) from
# the graph to obtain the path (x, z).

# Because join() joins on keys, the edges are stored in reversed order.
edges = tc.map(lambda x_y: (x_y[1], x_y[0]))

oldCount = 0
nextCount = tc.count()
while True:
oldCount = nextCount
# Perform the join, obtaining an RDD of (y, (z, x)) pairs,
# then project the result to obtain the new (x, z) paths.
new_edges = tc.join(edges).map(lambda __a_b: (__a_b[1][1], __a_b[1][0]))
tc = tc.union(new_edges).distinct().cache()
nextCount = tc.count()
if nextCount == oldCount:
break

print("TC has %i edges" % tc.count())

sc.stop()

最佳答案

这段代码在您的机器上表现不佳的原因有很多,但很可能这只是 Spark iteration time increasing exponentially when using join 中描述的问题的另一种变体。 .检查是否确实如此的最简单方法是提供 spark.default.parallelism提交参数:

bin/spark-submit --conf spark.default.parallelism=2 \
examples/src/main/python/transitive_closure.py

如无其他限制, SparkContext.union , RDD.joinRDD.union将子分区的数量设置为父分区的总数。通常这是一种理想的行为,但如果反复应用,可能会变得非常低效。

关于performance - Spark 示例程序运行很慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35566029/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com