gpt4 book ai didi

apache-spark - 优化 Spark SQL 中的交叉连接

转载 作者:行者123 更新时间:2023-12-04 09:16:14 24 4
gpt4 key购买 nike

是否可以在 Spark SQL 中优化交叉连接?要求是根据另一个表中定义的年龄范围填充列 band_id。到目前为止,我已经能够通过 Cross JoinWHERE 子句实现相同的功能。但是,我希望是否有一种更好的方法来对此进行编码并缓解性能问题。我可以使用广播提示吗? (下面提供的sql)

客户:(1000 万条记录)

id | name | age
X1 | John | 22
V2 | Mark | 29
F4 | Peter| 42

Age_band 表:(10 条记录)

band_id | low_age | high_age
B123 | 10 | 19
X745 | 20 | 29
P134 | 30 | 39
Q245 | 40 | 50

预期输出:

id | name | age | band_id
X1 | John | 22 | X745
V2 | Mark | 29 | X745
F4 | Peter| 42 | Q245

查询:

select
from cust a
cross join age_band b
where a.age between b.low_age and b.high_age;

请指教。

最佳答案

来自 SparkStrategies.scala source,在你的情况下你似乎可以,但你不必指定 crossbroadcast 提示,因为 Broadcast Nested Loop Join 是 Spark 无论如何都会选择的:

   * ...
* - Broadcast nested loop join (BNLJ):
* Supports both equi-joins and non-equi-joins.
* Supports all the join types, but the implementation is optimized for:
* 1) broadcasting the left side in a right outer join;
* 2) broadcasting the right side in a left outer, left semi, left anti or existence join;
* 3) broadcasting either side in an inner-like join.
* For other cases, we need to scan the data multiple times, which can be rather slow.
* ...

关于apache-spark - 优化 Spark SQL 中的交叉连接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63194870/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com