gpt4 book ai didi

python - pyarrow.lib.ArrowInvalid : ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type' )

转载 作者:行者123 更新时间:2023-12-03 14:44:55 27 4
gpt4 key购买 nike

使用 pyarrow转换 pandas.DataFrame包含 Player反对 pyarrow.Table使用以下代码

import pandas as pd
import pyarrow as pa

class Player:
def __init__(self, name, age, gender):
self.name = name
self.age = age
self.gender = gender

def __repr__(self):
return f'<{self.name} ({self.age})>'

data = [
Player('Jack', 21, 'm'),
Player('Ryan', 18, 'm'),
Player('Jane', 35, 'f'),
]
df = pd.DataFrame(data, columns=['player'])
print(pa.Table.from_pandas(df))

我们得到错误:
pyarrow.lib.ArrowInvalid: ('Could not convert <Jack (21)> with type Player: did not recognize Python value type when inferring an Arrow data type', 'Conversion failed for column 0 with type object')

使用时遇到同样的错误
df.to_parquet('players.pq')
pyarrow可以吗?回退到使用 pickle 序列化这些 Python 对象?或者有更好的解决方案吗? pyarrow.Table最终将使用 Parquet.write_table() 写入磁盘.
  • 使用 Python 3.8.0、pandas 0.25.3、pyarrow 0.13.0。
  • pandas.DataFrame.to_parquet()不支持多索引,所以使用 pq.write_table(pa.Table.from_dataframe(pandas.DataFrame)) 的解决方案是首选。

  • 谢谢!

    最佳答案

    我的建议是将数据插入到已经序列化的 DataFrame 中。
    最佳选择 - 使用数据类(python >=3.7)
    将 Player 类定义为装饰器的数据类,并让序列化为您本地完成(到 JSON)。

    import pandas as pd
    from dataclasses import dataclass

    @dataclass
    class PlayerV2:
    name:str
    age:int
    gender:str

    def __repr__(self):
    return f'<{self.name} ({self.age})>'


    dataV2 = [
    PlayerV2(name='Jack', age=21, gender='m'),
    PlayerV2(name='Ryan', age=18, gender='m'),
    PlayerV2(name='Jane', age=35, gender='f'),
    ]

    # The serialization is done natively to JSON
    df_v2 = pd.DataFrame(data, columns=['player'])
    print(df_v2)

    # Can still get the objects's attributes by deserializeing the record
    json.loads(df_v2["player"][0])['name']
    手动序列化对象(python < 3.7)
    在 Player 类中定义一个序列化函数,并在创建 Dataframe 之前序列化每个实例。
    import pandas as pd
    import json

    class Player:
    def __init__(self, name, age, gender):
    self.name = name
    self.age = age
    self.gender = gender

    def __repr__(self):
    return f'<{self.name} ({self.age})>'

    # The serialization function for JSON, if for some reason you really need pickle you can use it instead
    def toJSON(self):
    return json.dumps(self, default=lambda o: o.__dict__)

    # Serialize the objects before inserting it into the DataFrame
    data = [
    Player('Jack', 21, 'm').toJSON(),
    Player('Ryan', 18, 'm').toJSON(),
    Player('Jane', 35, 'f').toJSON(),
    ]
    df = pd.DataFrame(data, columns=['player'])

    # You can see all the data inserted as a serialized json into the column player
    print(df)

    # Can still get the objects's attributes by deserializeing the record
    json.loads(df["player"][0])['name']

    关于python - pyarrow.lib.ArrowInvalid : ('Could not convert X with type Y: did not recognize Python value type when inferring an Arrow data type' ),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59636745/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com