gpt4 book ai didi

python - 比较两个数据框中的列名pyspark

转载 作者:行者123 更新时间:2023-12-02 00:47:26 25 4
gpt4 key购买 nike

我在 pyspark dfdata 中有两个数据框。架构如下所示

>>> df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- Date: timestamp (nullable = false)
|-- ZipCode: integer (nullable = true)
|-- car: string (nullable = true)
|-- van: string (nullable = true)

>>> data.printSchema()
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- address: string (nullable = true)
|-- nation: string (nullable = true)
|-- date: string (nullable = true)
|-- zipcode: integer (nullable = true)

现在我想通过比较模式将列 car 和 van 添加到我的 data 数据框。

如果列相同,我还想比较两个数据框,但如果列不同,则将列添加到没有列的数据框中。

我们如何在 pyspark 中实现这一目标。

仅供引用,我正在使用 spark 1.6

once the columns are added to the data frame. The values for those columns in the newly added data frame Should be null.

for example here we are adding columns to data data frame so the columns car and van in data data frame should contain null values but the same columns in df data frame should have their original values

what happens if there are more than 2 new columns to be added

最佳答案

由于模式不是由 StructFields 列表组成的 StructType,我们可以检索字段列表,比较并找到缺失的列,

df_schema = df.schema.fields
data_schema = data.schema.fields
df_names = [x.name.lower() for x in df_scehma]
data_names = [x.name.lower() for x in data_schema]
if df_schema <> data_schema:
col_diff = set(df_names) ^ set(data_names)
col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if ((x[0] is not None and x[0].name.lower() in col_diff) or x[1].name.lower() in col_diff)]
for i in col_list:
if i[0] in df_names:
data = data.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
df = df.withColumn("%s"%i[0],lit(None).cast(i[1]))
else:
print "Nothing to do"

您已经提到如果没有空值则添加该列,但是您的架构差异是可为空的列,因此没有使用该检查。如果你需要它,然后添加 nullable 的检查,如下所示,

col_list = [(x[0].name,x[0].dataType) for x in map(None,df_schema,data_schema) if (x[0].name.lower() in col_diff or x[1].name.lower() in col_diff) and not x.nullable]

有关 StructType 和 StructFields 的更多信息,请查看文档, https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.types.StructType

关于python - 比较两个数据框中的列名pyspark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42703133/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com