gpt4 book ai didi

python - 获取 TypeError ("StructType can not accept object %r in type %s"% (object, type(obj)))

转载 作者:太空宇宙 更新时间:2023-11-03 15:48:29 24 4
gpt4 key购买 nike

我正在创建如下所示的 spark session (spark 版本 2.2.1)

SparkS = SparkSession.builder\
.appName("Test")\
.master("local[*]")\
.getOrCreate()

然后像下面这样创建sparkcontext

raw_data = SparkS\
.sparkContext\
.textFile("C:\\Users\\...\\RawData\\nasdaq.csv")

出于验证目的,我使用以下方式打印数据:

print(raw_data.take(3))

输出为

['43084,6871.549805,6945.819824,6871.450195,6936.580078,6936.580078,3510420000', '43087,6980.399902,7003.890137,6975.540039,6994.759766,6994.759766,2144360000', '43088,6991.25,6995.879883,6951.490234,6963.850098, 6963.850098,2071060000']

现在我通过定义如下模式将 RDD 转换为数据帧:

schema = StructType().add("date", StringType())\
.add("open", StringType())\
.add("high", StringType())\
.add("low", StringType())\
.add("close", StringType())\
.add("adj_close", StringType())\
.add("volume", StringType())

geioIP = SparkS.createDataFrame(raw_data,schema)
print(geioIP)

输出是:

DataFrame[date: string, open: string, high: string, low: string, close: string, adj_close: string, volume: string]

到目前为止一切顺利,但问题是当我调用 geioIP.show(2) 时,它给了我一个错误

18/01/23 12:58:48 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 177, in main
File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py", line 172, in process
File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\serializers.py", line 268, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "C:\Users\rajnish.kumar\AppData\Local\Programs\Python\Python36\lib\site-packages\pyspark\sql\session.py", line 520, in prepare
verify_func(obj, schema)
File "C:\spark-2.2.1-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\sql\types.py", line 1371, in _verify_type
raise TypeError("StructType can not accept object %r in type %s" % (obj, type(obj)))
TypeError: StructType can not accept object '43084,6871.549805,6945.819824,6871.450195,6936.580078,6936.580078,3510420000' in type <class 'str'>

经过这个 link ,我所做的是将所有 csv 数据转换为文本格式,但我仍然遇到上述问题。

最佳答案

问题是 RDD 中的每一行都是一个字符串(即一列),而您的模式包含 7 列。在您使用操作(如 show)之前,RDD 实际上并未转换为数据帧,这就是它不会立即崩溃的原因。

由于您希望将数据放在数据框中,最简单的解决方案是在开始时将数据作为数据框读取:

geioIP = SparkS.read.csv("C:\\Users\\...\\RawData\\nasdaq.csv", schema=schema)

或者如果您想继续使用 RDD 和 createDataFrame,您可以使用 split 函数(如果有空格,可以使用 strip ).

raw_data = raw_data.map(lambda x: [c.strip() for c in x.split(',')])
geioIP = SparkS.createDataFrame(raw_data,schema)

关于python - 获取 TypeError ("StructType can not accept object %r in type %s"% (object, type(obj))),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48396550/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com