gpt4 book ai didi

pandas - 运行时错误 : Unsupported type in conversion to Arrow: VectorUDT

转载 作者:行者123 更新时间:2023-12-03 16:52:03 30 4
gpt4 key购买 nike

我想将一个大的 spark 数据框转换为超过 1000000 行的 Pandas。我尝试使用以下代码将 spark 数据帧转换为 Pandas 数据帧:

spark.conf.set("spark.sql.execution.arrow.enabled", "true")
result.toPandas()

但是,我得到了错误:
TypeError                                 Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1949 import pyarrow
-> 1950 to_arrow_schema(self.schema)
1951 tables = self._collectAsArrow()

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_schema(schema)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in <listcomp>(.0)
1650 fields = [pa.field(field.name, to_arrow_type(field.dataType), nullable=field.nullable)
-> 1651 for field in schema]
1652 return pa.schema(fields)

/usr/local/lib/python3.6/dist-packages/pyspark/sql/types.py in to_arrow_type(dt)
1641 else:
-> 1642 raise TypeError("Unsupported type in conversion to Arrow: " + str(dt))
1643 return arrow_type

TypeError: Unsupported type in conversion to Arrow: VectorUDT

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)
<ipython-input-138-4e12457ff4d5> in <module>()
1 spark.conf.set("spark.sql.execution.arrow.enabled", "true")
----> 2 result.toPandas()

/usr/local/lib/python3.6/dist-packages/pyspark/sql/dataframe.py in toPandas(self)
1962 "'spark.sql.execution.arrow.enabled' is set to true. Please set it to false "
1963 "to disable this.")
-> 1964 raise RuntimeError("%s\n%s" % (_exception_message(e), msg))
1965 else:
1966 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

RuntimeError: Unsupported type in conversion to Arrow: VectorUDT
Note: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.enabled' is set to true. Please set it to false to disable this.

它不起作用,但如果我将箭头设置为 false,它就起作用了。但它太慢了......知道吗?

最佳答案

Arrow 只支持一小组类型,Spark UserDefinedTypes ,包括 mlmllib VectorUDTs不在支持的范围内。

如果您想使用箭头,则必须将数据转换为支持的格式。一种可能的解决方案是扩展 Vectors成列 - How to split Vector into columns - using PySpark

您还可以使用 to_json 序列化输出方法:

from pyspark.sql.functions import to_json

df.withColumn("your_vector_column", to_json("your_vector_column"))

但如果数据足够大 toPandas成为一个严重的瓶颈,那么我会重新考虑收集这样的数据。

关于pandas - 运行时错误 : Unsupported type in conversion to Arrow: VectorUDT,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51175500/

30 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com