gpt4 book ai didi

apache-spark - Spark SQL 广播哈希连接

转载 作者:行者123 更新时间:2023-12-03 07:19:10 26 4
gpt4 key购买 nike

我正在尝试使用 SparkSQL 在数据帧上执行广播哈希连接,如下所述:https://docs.cloud.databricks.com/docs/latest/databricks_guide/06%20Spark%20SQL%20%26%20DataFrames/05%20BroadcastHashJoin%20-%20scala.html

在该示例中,(小)DataFrame通过 saveAsTable 持久化,然后通过 Spark SQL 进行连接(即通过 sqlContext.sql("..."))

我遇到的问题是我需要使用 SparkSQL API 来构建我的 SQL(我需要使用 ID 列表连接约 50 个表,并且不想手动编写 SQL)。

How do I tell spark to use the broadcast hash join via the API?  The issue is that if I load the ID list (from the table persisted via `saveAsTable`) into a `DataFrame` to use in the join, it isn't clear to me if Spark can apply the broadcast hash join.

最佳答案

您可以显式地将DataFrame标记为足够小以进行广播使用广播功能:

Python:

from pyspark.sql.functions import broadcast

small_df = ...
large_df = ...

large_df.join(broadcast(small_df), ["foo"])

或广播提示(Spark >= 2.2):

large_df.join(small_df.hint("broadcast"), ["foo"])

斯卡拉:

import org.apache.spark.sql.functions.broadcast

val smallDF: DataFrame = ???
val largeDF: DataFrame = ???

largeDF.join(broadcast(smallDF), Seq("foo"))

或广播提示(Spark >= 2.2):

largeDF.join(smallDF.hint("broadcast"), Seq("foo"))

SQL

您可以使用提示( Spark >= 2.2 ):

SELECT /*+ MAPJOIN(small) */ * 
FROM large JOIN small
ON large.foo = small.foo

SELECT /*+  BROADCASTJOIN(small) */ * 
FROM large JOIN small
ON large.foo = small.foo

SELECT /*+ BROADCAST(small) */ * 
FROM large JOIN small
ON larger.foo = small.foo

R(SparkR):

带有提示(Spark >= 2.2):

join(large, hint(small, "broadcast"), large$foo == small$foo)

使用广播(Spark >= 2.3)

join(large, broadcast(small), large$foo == small$foo)

注意:

如果其中一个结构相对较小,则广播连接很有用。否则它可能比完全洗牌要贵得多。

关于apache-spark - Spark SQL 广播哈希连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37487318/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com