gpt4 book ai didi

pyspark - spark join 引发 "Detected cartesian product for INNER join"

转载 作者:行者123 更新时间:2023-12-04 21:34:21 24 4
gpt4 key购买 nike

我有一个数据框,我想为每一行添加 new_col=max(some_column0)按其他一些 column1 分组:

maxs = df0.groupBy("catalog").agg(max("row_num").alias("max_num")).withColumnRenamed("catalog", "catalogid")
df0.join(maxs, df0.catalog == maxs.catalogid).take(4)

在第二个字符串中,我收到一个错误:

AnalysisException: u'Detected cartesian product for INNER join between logical plans\nProject ... Use the CROSS JOIN syntax to allow cartesian products between these relations.;'



我不明白的是:为什么 spark 在这里找到笛卡尔积?

获得此错误的一种可能方法:我将 DF 保存到 Hive 表,然后再次初始化 DF 作为从表中选择。或者用 hive 查询替换这 2 个字符串 - 无论如何。但我不想保存DF。

最佳答案

Why does spark think this is a cross/cartesian join 中所述,这可能是由以下原因引起的:

This happens because you join structures sharing the same lineage and this leads to a trivially equal condition.



至于笛卡尔积是如何产生的?您可以引用 Identifying and Eliminating the Dreaded Cartesian Product .

关于pyspark - spark join 引发 "Detected cartesian product for INNER join",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42154476/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com