gpt4 book ai didi

python-3.x - 如何在 PySpark 中广播 RDD?

转载 作者:行者123 更新时间:2023-12-04 03:06:06 26 4
gpt4 key购买 nike

是否可以在 Python 中广播 RDD?

我正在关注“Advanced Analytics with Spark: Patterns for Learning from Data at Scale”这本书,并且在第 3 章需要广播一个 RDD。我正在尝试使用 Python 而不是 Scala 来遵循示例。

无论如何,即使是这个简单的例子我也有一个错误:

my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)

错误是:

"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.

我不太明白错误指的是什么“操作或转换”。

我正在使用 spark-2.1.1-hadoop2.7

重要编辑:这本书是正确的。我只是没有读到它不是正在广播的 RDD,而是通过 collectAsMap() 获得的 map 版本。

谢谢!

最佳答案

Is it possible to broadcast an RDD in Python?

TL;DR 否。

当您认为 RDD 真正是什么时,您会发现它根本不可能。 RDD 中没有任何内容可以广播。它太脆弱(可以这么说)。

RDD 是一种数据结构,描述对某些数据集的分布式计算。通过 RDD 的特性,您可以描述计算什么以及如何计算。它是一个抽象实体。

引用 RDD 的 scaladoc :

Represents an immutable, partitioned collection of elements that can be operated on in parallel

Internally, each RDD is characterized by five main properties:

  • A list of partitions

  • A function for computing each split

  • A list of dependencies on other RDDs

  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

您可以广播的内容不多(引用 SparkContext.broadcast 方法的 scaladoc):

broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.

您只能广播一个真实的值,但 RDD 只是一个容器值,只有在执行者处理其数据时才可用。

来自 Broadcast Variables :

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.

稍后在同一文档中:

This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

但是,您可以收集 RDD 持有 的数据集并按如下方式广播它:

my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset

在“收集数据集”步骤中,数据集离开一个 RDD 空间并成为一个本地可用的集合,一个 Python 值,然后可以广播。

关于python-3.x - 如何在 PySpark 中广播 RDD?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44216637/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com