gpt4 book ai didi

apache-spark - 在嵌套字段上加入 PySpark DataFrames

转载 作者:行者123 更新时间:2023-12-04 14:10:25 25 4
gpt4 key购买 nike

我想在这两个 PySpark DataFrame 之间执行连接:

from pyspark import SparkContext
from pyspark.sql.functions import col

sc = SparkContext()

df1 = sc.parallelize([
['owner1', 'obj1', 0.5],
['owner1', 'obj1', 0.2],
['owner2', 'obj2', 0.1]
]).toDF(('owner', 'object', 'score'))

df2 = sc.parallelize(
[Row(owner=u'owner1',
objects=[Row(name=u'obj1', value=Row(fav=True, ratio=0.3))])]).toDF()

必须在对象名称上执行连接,即 df2 的对象和 df1 的对象内部的字段名称。

我能够对嵌套字段执行 SELECT,如

df2.where(df2.owner == 'owner1').select(col("objects.value.ratio")).show()

但我无法运行此连接:

df2.alias('u').join(df1.alias('s'), col('u.objects.name') == col('s.object'))

返回错误

pyspark.sql.utils.AnalysisException: u"cannot resolve '(objects.name = cast(object as double))' due to data type mismatch: differing types in '(objects.name = cast(object as double))' (array and double).;"



任何想法如何解决这个问题?

最佳答案

由于您想匹配和提取特定元素,最简单的方法是 explode行:

matches = df2.withColumn("object", explode(col("objects"))).alias("u").join(
df1.alias("s"),
col("s.object") == col("u.object.name")
)

matches.show()
## +-------------------+------+-----------------+------+------+-----+
## | objects| owner| object| owner|object|score|
## +-------------------+------+-----------------+------+------+-----+
## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.5|
## |[[obj1,[true,0.3]]]|owner1|[obj1,[true,0.3]]|owner1| obj1| 0.2|
## +-------------------+------+-----------------+------+------+-----+

另一种但效率很低的方法是使用 array_contains :

matches_contains = df1.alias("s").join(
df2.alias("u"), expr("array_contains(objects.name, object)"))

它是无效的,因为它将扩展为笛卡尔积:

matches_contains.explain()
## == Physical Plan ==
## Filter array_contains(objects#6.name,object#4)
## +- CartesianProduct
## :- Scan ExistingRDD[owner#3,object#4,score#5]
## +- Scan ExistingRDD[objects#6,owner#7]

如果数组的大小相对较小,则可以生成 array_contains 的优化版本。正如我在这里展示的: Filter by whether column value equals a list in spark

关于apache-spark - 在嵌套字段上加入 PySpark DataFrames,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36576196/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com