python-3.x - 如何在 PySpark 中广播 RDD？-6ren

python-3.x - 如何在 PySpark 中广播 RDD？

转载作者：行者123 更新时间：2023-12-04 03:06:06

26

4

是否可以在 Python 中广播 RDD？

我正在关注“Advanced Analytics with Spark: Patterns for Learning from Data at Scale”这本书，并且在第 3 章需要广播一个 RDD。我正在尝试使用 Python 而不是 Scala 来遵循示例。

无论如何，即使是这个简单的例子我也有一个错误:

my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)

错误是:

"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
 rdd1.map transformation. For more information, see SPARK-5063.

我不太明白错误指的是什么“操作或转换”。

我正在使用 spark-2.1.1-hadoop2.7。

重要编辑:这本书是正确的。我只是没有读到它不是正在广播的 RDD，而是通过 collectAsMap() 获得的 map 版本。

谢谢!

最佳答案

Is it possible to broadcast an RDD in Python?

TL;DR 否。

当您认为 RDD 真正是什么时，您会发现它根本不可能。 RDD 中没有任何内容可以广播。它太脆弱(可以这么说)。

RDD 是一种数据结构，描述对某些数据集的分布式计算。通过 RDD 的特性，您可以描述计算什么以及如何计算。它是一个抽象实体。

引用 RDD 的 scaladoc :

Represents an immutable, partitioned collection of elements that can be operated on in parallel

Internally, each RDD is characterized by five main properties:

A list of partitions

A function for computing each split

A list of dependencies on other RDDs

Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)

Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

您可以广播的内容不多(引用 SparkContext.broadcast 方法的 scaladoc):

broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.

您只能广播一个真实的值，但 RDD 只是一个容器值，只有在执行者处理其数据时才可用。

来自 Broadcast Variables :

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.

稍后在同一文档中:

This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.

但是，您可以收集 RDD 持有的数据集并按如下方式广播它:

my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset

在“收集数据集”步骤中，数据集离开一个 RDD 空间并成为一个本地可用的集合，一个 Python 值，然后可以广播。

关于python-3.x - 如何在 PySpark 中广播 RDD？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44216637/

26

4

0

文章推荐： sql - 相当于交叉应用并选择顶部

文章推荐： path - 如何找到我正在执行的 tcsh shell 脚本的位置？

文章推荐： c# - 如何在 RichTextBox C# 中显示行数

文章推荐： drop-down-menu - Libgdx - 制作下拉菜单/设置屏幕

JavaRMI遇到的ConnectionrefusedtoHost:127.x.x.x/192.x.x.x/10.x.x.x问题解决方法
问题故障解决记录 -- Java RMI Connection refused to host: x.x.x.x .... 在学习JavaRMI时，我遇到了以下情况问题原因：可
haskell - 为什么 `f x = x x` 和 `g x = x x x x x` 有相同的类型
我正在玩 Rank-N-type 并尝试输入 x x .但我发现这两个函数可以以相同的方式输入，这很不直观。 f :: (forall a b. a -> b) -> c f x = x x g ::
java - 比较两个版本字符串(4.x.x.x、5.x.x.x)
这个问题已经有答案了: How do you compare two version Strings in Java? (31 个回答) 已关闭 8 年前。有谁知道如何在Java中比较两个版本字符串
java - x=20;x=++x+++x + x++ ;java中x的最终值为65
这个问题已经有答案了: How do the post increment (i++) and pre increment (++i) operators work in Java? (14 个回答)
linux - 如何获取完整的目标IP地址(x.x.x.x/x)netstat命令？
下面是带有 -n 和 -r 选项的 netstat 命令的输出，其中目标字段显示压缩地址 (127.1/16)。我想知道 netstat 命令是否有任何方法或选项可以显示整个目标 IP (127.1.
logic - 我如何根据精益原则证明 (∀ x, ¬ A x) → ¬ ∃ x, A x？
我知道要证明 : (¬ ∀ x, p x) → (∃ x, ¬ p x) 证明是: theorem : (¬ ∀ x, p x) → (∃ x, ¬ p x) := begin intro n
c++ - x*x != x*x 在自动变量中？
x * x 如何通过将其存储在“auto 变量”中来更改？我认为它应该仍然是相同的，并且我的测试表明类型、大小和值显然都是相同的。但即使 x * x == (xx = x * x) 也是错误的。什么
c# - 如何将表达式 x=>!x 重写为 x=>x!=true 并将 x=>x 重写为 x=>x==true
假设，我们这样表达: someIQueryable.Where(x => x.SomeBoolProperty) someIQueryable.Where(x => !x.SomeBoolProper
regex - 为什么正则表达式引擎选择从 `..X` 匹配模式 `.X|..X|X.`？
我有一个字符串 1234X5678 我使用这个正则表达式来匹配模式 .X|..X|X. 我得到了 34X 问题是为什么我没有得到 4X 或 X5？为什么正则表达式选择执行第二种模式？最佳答案这里
javascript - 可以 (x++ !== x) && (x++ === x);返回真？
我的一个 friend 在面试时遇到了这个问题找到使该函数返回真值的 x 值 function f(x) { return (x++ !== x) && (x++ === x); } 面试官
java - 为什么通常 Map = new HashMap() 而不是 HashMap = new HashMap()？
这个问题在这里已经有了答案: 10年前关闭。 Possible Duplicate: Isn't it easier to work with foo when it is represented b
针对多个版本的 Android 应用程序开发，即 1.x、2.x.x、3.x.x、4.x.x
我是 android 的新手，我一直在练习开发一个针对 2.2 版本的应用程序，我需要帮助了解如何将我的应用程序扩展到其他版本，即 1.x、2.3.x、3 .x 和 4.x.x，以及一些针对屏幕分辨率
x = [x] && x.push(x) when var x; 之间的 javascript 数组混淆
为什么案例 1 给我们 :error: TypeError: x is undefined on line... //case 1 var x; x.push(x); console.log(x);
python - Python 列表中 x += x 和 x = x + x 的区别
代码优先: # CASE 01 def test1(x): x += x print x l = [100] test1(l) print l CASE01 输出: [100, 100
java - 如何确定看起来像这样的大 O : (x -1) + (x - 2) + (x - 3) . .. (x - x)
我正在努力温习我的大计算。如果我有将所有项目移至 'i' 2 个空格右侧的函数，我有一个如下所示的公式: (n -1) + (n - 2) + (n - 3) ... (n - n) 第一次迭代我必须
javascript - 从 IP 字符串计算 IP 范围等于 x.x.x.x/x
给定 IP 字符串(如 x.x.x.x/x)，我如何或将如何计算 IP 的范围最常见的情况可能是 198.162.1.1/24但可以是任何东西，因为法律允许的任何东西。我要带198.162.1.1/
javascript - 为什么 var x = x = x || {} 比 var x = x || 更彻底{}？
在我作为初学者努力编写干净的 Javascript 代码时，我最近阅读了 this article当我偶然发现这一段时，关于 JavaScript 中的命名空间: The code at the ve
javascript - var x = x || {}；与 x = window.x || {}；
我正在编写一个脚本，我希望避免污染 DOM 的其余部分，它将是一个用于收集一些基本访问者分析数据的第 3 方脚本。我通常使用以下内容创建一个伪“命名空间”: var x = x || {}; 我正在
docker - create_network():无法分配网关(x.x.x.x):该地址已在测试用例中使用
我尝试运行我的test_container_services.py套件，但遇到了以下问题： docker.errors.APIError：500服务器错误：内部服务器错误（“ b'{” message
c# - "x as X != null"和 "x is X"总是返回相同的结果吗？
是否存在这两个 if 语句会产生不同结果的情况？ if(x as X != null) { // Do something } if(x is X) { // Do something } 编

首页

博学

6Ren·AI

商城

python-3.x - 如何在 PySpark 中广播 RDD？