gpt4 book ai didi

apache-spark - 如何检查 Spark 中两个 DataFrame 列的交集

转载 作者:行者123 更新时间:2023-12-01 09:48:09 27 4
gpt4 key购买 nike

使用 pysparksparkr (最好是两者),我怎样才能得到两个的交集DataFrame列?例如,在 sparkr我有以下 DataFrames :

newHires <- data.frame(name = c("Thomas", "George", "George", "John"),
surname = c("Smith", "Williams", "Brown", "Taylor"))
salesTeam <- data.frame(name = c("Lucas", "Bill", "George"),
surname = c("Martin", "Clark", "Williams"))
newHiresDF <- createDataFrame(newHires)
salesTeamDF <- createDataFrame(salesTeam)

#Intersect works for the entire DataFrames
newSalesHire <- intersect(newHiresDF, salesTeamDF)
head(newSalesHire)

name surname
1 George Williams

#Intersect does not work for single columns
newSalesHire <- intersect(newHiresDF$name, salesTeamDF$name)
head(newSalesHire)

Error in as.vector(y) : no method for coercing this S4 class to a vector



我怎样才能得到 intersect为单列工作?

最佳答案

您需要两个 Spark DataFrame 才能使用 intersect 函数。您可以使用 select 函数从每个 DataFrame 中获取特定列。

在 SparkR 中:

newSalesHire <- intersect(select(newHiresDF, 'name'), select(salesTeamDF,'name'))

在pyspark中:
newSalesHire = newHiresDF.select('name').intersect(salesTeamDF.select('name')) 

关于apache-spark - 如何检查 Spark 中两个 DataFrame 列的交集,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44168379/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com