gpt4 book ai didi

join - pyspark 左外连接多列

转载 作者:行者123 更新时间:2023-12-02 01:13:47 28 4
gpt4 key购买 nike

我正在使用 Pyspark 2.1.0。

我正在尝试使用以下方法执行两个数据帧的左外连接:我有 2 个数据框,其架构如下所示:

crimes
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- CRIME_TYPE: string (nullable = true)
|-- CURRENT_OUTCOME: string (nullable = true)

outcomes
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- FINAL_OUTCOME: string (nullable = true)

我需要能够根据左外侧将犯罪与结果结合起来,因为单个犯罪存在许多结果。我想排除两个数据框共有的列。

我尝试了以下两种方法,但每种方法都会产生各种错误:

cr_outs = crimes.join(outcomes, crimes.CRIME_ID == outcomes.CRIME_ID, 'left_outer')\
.select(['crimes.'+c for c in crimes.columns] + ['outcomes.FINAL_OUTCOME'])

from pyspark.sql.functions as fn
cr_outs = crimes.alias('a').join(outcomes.alias('b'), fn.col('b.CRIME_ID') = fn.col('a.CRIME_ID') ,'left_outer')\
.select([fn.col('a.'+ c) for c in a.columns] + b.FINAL_OUTCOME)

有人可以建议另一种方法吗?谢谢

最佳答案

这成功了,看来你必须使用别名,类似于posted before ,虽然在 PySpark 2.1.0 中稍微简单一些。

cr_outs = crimes.alias('a')\
.join(outcomes, crimes.CRIME_ID == outcomes.CRIME_ID, 'left_outer')\
.select(*[col('a.'+c) for c in crimes.columns]
+ [outcomes.FINAL_OUTCOME])

cr_outs.show()
cr_outs.printSchema()

--------+-------------------+--------------------+--------------------+--------------------+
| CRIME_ID|YEAR_MTH| REPORTED_BY| FALLS_WITHIN|LONGITUDE| LATITUDE| LOCATION|LSOA_CODE| LSOA_NAME| CRIME_TYPE| CURRENT_OUTCOME| FINAL_OUTCOME|
+--------------------+--------+--------------------+--------------------+---------+---------+--------------------+---------+-------------------+--------------------+--------------------+--------------------+
|426085c2ed33af598...| 2017-01|City of London Po...|City of London Po...|-0.086051| 51.51357|On or near Finch ...|E01032739|City of London 001F| Other theft|Investigation com...|Investigation com...|
|33a3ddb8160a854a4...| 2017-01|City of London Po...|City of London Po...|-0.077777|51.518047|On or near Sandy'...|E01032
..
..
..
root
|-- CRIME_ID: string (nullable = true)
|-- YEAR_MTH: string (nullable = true)
|-- REPORTED_BY: string (nullable = true)
|-- FALLS_WITHIN: string (nullable = true)
|-- LONGITUDE: float (nullable = true)
|-- LATITUDE: float (nullable = true)
|-- LOCATION: string (nullable = true)
|-- LSOA_CODE: string (nullable = true)
|-- LSOA_NAME: string (nullable = true)
|-- CRIME_TYPE: string (nullable = true)
|-- CURRENT_OUTCOME: string (nullable = true)
|-- FINAL_OUTCOME: string (nullable = true)

如您所见,栏目比我原来的帖子多得多,但没有重复的栏目,也没有重命名栏目:-)

关于join - pyspark 左外连接多列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43642676/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com