gpt4 book ai didi

apache-spark - Spark 中广播对象的最大大小是多少?

转载 作者:行者123 更新时间:2023-12-04 11:10:29 24 4
gpt4 key购买 nike

使用数据框时 broadcast函数或 SparkContext broadcast函数,可以分派(dispatch)给所有执行程序的最大对象大小是多少?

最佳答案

broadcast function :

默认为 10mb,但我们一直使用到 300 mb,由 spark.sql.autoBroadcastJoinThreshold 控制.

AFAIK,这完全取决于可用内存。所以对此没有明确的答案。我想说的是,它应该小于大型数据帧,您可以估计大或小的数据帧大小,如下所示......

import org.apache.spark.util.SizeEstimator

logInfo(SizeEstimator.estimate(yourlargeorsmalldataframehere))

基于此您可以通过 broadcast提示框架。

也看看
Scala 文档来自
sql/execution/SparkStrategies.scala

其中说....

  • Broadcast: if one side of the join has an estimated physical size that is smaller than the user-configurable [[SQLConf.AUTO_BROADCASTJOIN_THRESHOLD]] threshold or if that side has an explicit broadcast hint (e.g. the user applied the
    [[org.apache.spark.sql.functions.broadcast()]] function to a DataFrame), then that side of the join will be broadcasted and the other side will be streamed, with no shuffling
    performed. If both sides are below the threshold, broadcast the smaller side. If neither is smaller, BHJ is not used.
  • Shuffle hash join: if the average size of a single partition is small enough to build a hash table.
  • Sort merge: if the matching join keys are sortable.
  • If there is no joining keys, Join implementations are chosen with the following precedence:
    • BroadcastNestedLoopJoin: if one side of the join could be broadcasted
    • CartesianProduct: for Inner join
    • BroadcastNestedLoopJoin


也看看 other-configuration-options
SparkContext.broadcast (洪流广播):

广播共享变量也有一个属性 spark.broadcast.blockSize=4MAFAIK 我也没有看到过硬核限制......

欲了解更多信息请。见 TorrentBroadcast.scala

编辑 :

但是,您可以查看 2GB 问题,尽管这在文档中没有正式声明(我在文档中看不到任何此类内容)。
请看 SPARK-6235 which is "IN PROGRESS" state & SPARK-6235_Design_V0.02.pdf .

关于apache-spark - Spark 中广播对象的最大大小是多少?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41045917/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com