gpt4 book ai didi

pandas - 属性错误: 'StructType' object has no attribute 'encode'

转载 作者:行者123 更新时间:2023-12-03 06:10:00 27 4
gpt4 key购买 nike

我正在尝试从 pandas 数据帧创建 Spark 数据帧。我正在基于由数组的结构类型和结构字段组成的模式构建模式。以下是示例架构:

mySchema = (
StructType(
[
StructField("country_code", StringType(), True),
StructField("unit_id", StringType(), True),
StructField("date", DateType(), True),
StructField("health_category_car_door", StringType(), True),
StructField("reason_car", StringType(), True),
StructField("reason_landing", StringType(), True),
StructField(
"reasonDetails_car_door",
StructType(
[
StructField(
"car_doors",
ArrayType(
StructType(
[
StructField("opmode", StringType(), True),
StructField("count", IntegerType(), True),
StructField(
"window_length", IntegerType(), True
),
]
),
True,
),
True,
),
StructField("landing_doors", StringType(), True),
]
),
True,
),
]
),
StructField("health_category_landing_door", StringType(), True),
StructField("num_yellow_preds_in_last_14_days", IntegerType(), True),
StructField(
"reasonDetails_landing_door",
ArrayType(
StructType(
[
StructField("id", StringType(), True),
StructField(
"causes",
ArrayType(
StructType(
[
StructField("opmode", StringType(), True),
StructField("count", IntegerType(), True),
StructField("window_length", IntegerType(), True),
]
),
True,
),
True,
),
StructField(
"num_yellow_preds_in_last_14_days", IntegerType(), True
),
]
),
True,
),
),
)

sparkDF = spark.createDataFrame(df_new, mySchema)
sparkDF.printSchema()

它给出了错误。

/databricks/spark/python/pyspark/sql/session.py in createDataFrame(self, data, schema, samplingRatio, verifySchema)
938 elif isinstance(schema, (list, tuple)):
939 # Must re-encode any unicode strings to be consistent with StructField names
--> 940 schema = [x.encode("utf-8") if not isinstance(x, str) else x for x in schema]
941
942 try:

在调试时,我确实了解架构需要按照文章(Pyspark error on creating dataframe: 'StructField' object has no attribute 'encode')中所述进行更新,但无法理解我需要如何更新架构。任何人都可以指导吗?

最佳答案

我相信你使用df = pd.DataFrame(json.loads(<your_data>))基于这些数据创建 pandas 数据框并将其转换为 Spark 提供架构。

我已经尝试过这个,即使我遇到了和你一样的错误。

enter image description here

出现此错误的原因是架构应为 StructTypeStructField .

在你的模式中,如果你清楚地观察到一些StructField已超出 StructType .

您可以使用下面的架构,我尝试查看 pandas 数据框中的数据。

enter image description here

在这里,您可以看到 landing_doorscar_doors 是行名称,而 reasonDetails_car_doorreasonDetails_landing_door 是列表类型或数组。

所以,我修改了架构如下并尝试,它成功了。

  import pandas as pd
import json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,ArrayType,DateType,MapType,LongType

data = '{"country_code":"xxx","unit_id":"xxx","date":1691280000000,"health_category_car_door":"xxx",
"num_yellow_preds_in_last_14_days":10,
"reasonDetails_car_door":{"landing_doors":null,"car_doors":[{"opmode":"xxx","count":10,"window_length":1}]},"reason_car":"High count","health_category_landing_door":"xxxx",
"reasonDetails_landing_door":{"car_doors":null,"landing_doors":[{"id":"xx","causes":[{"opmode":"xxx","count":1,"window_length":14},{"opmode":"xxx","count":10,"window_length":1}],"num_yellow_preds_in_last_14_days":1}]},
"reason_landing":"High count."}'

sc = StructType([
StructField('country_code', StringType(), True),
StructField('unit_id', StringType(), True),
StructField('date', LongType(), True),
StructField('health_category_car_door', StringType(), True),
StructField('num_yellow_preds_in_last_14_days', LongType(), True),
StructField('reasonDetails_car_door',
ArrayType(
StructType([
StructField('count', LongType(), True),
StructField('opmode', StringType(), True),
StructField('window_length', LongType(), True)]),True),True),
StructField('reason_car', StringType(), True),
StructField('health_category_landing_door', StringType(), True),
StructField('reasonDetails_landing_door',
ArrayType(
StructType([
StructField('causes',
ArrayType(
StructType([
StructField('count', LongType(), True),
StructField('opmode', StringType(), True),
StructField('window_length', LongType(), True)]),True),True),
StructField('id', StringType(), True),
StructField('num_yellow_preds_in_last_14_days', LongType(), True)]),True),True),
StructField('reason_landing', StringType(), True)])

json_data = json.loads(data)
sparkDF = spark.createDataFrame(pd.DataFrame(json_data),sc)
sparkDF.printSchema()

输出:

enter image description here

或者您可以在不使用 pandas 的情况下创建 Spark 数据框,并提供如下架构。

from pyspark.sql.types import StructType, StructField, StringType, IntegerType,ArrayType,DateType,MapType,LongType

mySchema = StructType([
StructField("country_code", StringType(), True),
StructField("unit_id", StringType(), True),
StructField("date", StringType(), True),
StructField("health_category_car_door", StringType(), True),
StructField("reason_car", StringType(), True),
StructField("reason_landing", StringType(), True),
StructField(
"reasonDetails_car_door",
MapType(StringType(),
ArrayType(StructType([
StructField("opmode", StringType(), True),
StructField("count", IntegerType(), True),
StructField("window_length", IntegerType(), True),]),True,)),True),
StructField("health_category_landing_door", StringType(), True),
StructField("num_yellow_preds_in_last_14_days", IntegerType(), True),
StructField("reasonDetails_landing_door",
MapType(StringType(),
ArrayType(StructType([
StructField("id", StringType(), True),
StructField("causes",ArrayType(StructType([
StructField("opmode", StringType(), True),
StructField("count", IntegerType(), True),
StructField("window_length", IntegerType(), True)]),True),True),
StructField("num_yellow_preds_in_last_14_days", IntegerType(), True)]),True)),)
])

json_data = json.loads(data)
sparkDF = spark.createDataFrame(data=[json_data], schema=mySchema)
display(sparkDF)
sparkDF.printSchema()

输出:

enter image description here

关于pandas - 属性错误: 'StructType' object has no attribute 'encode' ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76919263/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com