gpt4 book ai didi

python - 将数组扩展到 dask 数据框中的列

转载 作者:太空宇宙 更新时间:2023-11-03 20:17:06 25 4
gpt4 key购买 nike

我有带有以下键的 avro 数据:“id、label、features”。id 和 label 是字符串,而 features 是浮点缓冲区。

import dask.bag as db
avros = db.read_avro('data.avro')
df = avros.to_dataframe()
convert = partial(np.frombuffer, dtype='float64')
X = df.assign(features=lambda x: x.features.apply(convert, meta='float64'))

我最终得到了这个 MCVE

  label id         features
0 good a [1.0, 0.0, 0.0]
1 bad b [1.0, 0.0, 0.0]
2 good c [0.0, 0.0, 0.0]
3 bad d [1.0, 0.0, 1.0]
4 good e [0.0, 0.0, 0.0]

我想要的输出是:

  label id   f1   f2   f3
0 good a 1.0 0.0 0.0
1 bad b 1.0 0.0 0.0
2 good c 0.0 0.0 0.0
3 bad d 1.0 0.0 1.0
4 good e 0.0 0.0 0.0

我尝试了一些类似于 pandas 的方法,即 df[['f1','f2','f3']] = df.features.apply(pd.Series) 不起作用就像 Pandas 一样。

我可以用像这样的循环进行遍历

for i in range(len(features)):
df[f'f{i}'] = df.features.map(lambda x: x[i])

但在实际用例中,我有数千个特征,并且这会遍历数据集数千次。

实现预期结果的最佳方法是什么?

最佳答案

In [68]: import string
...: import numpy as np
...: import pandas as pd

In [69]: M, N = 100, 100
...: labels = np.random.choice(['good', 'bad'], size=M)
...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
...: features = np.empty((M,), dtype=object)
...: features[:] = list(map(list, np.random.randn(M, N)))
...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
...: df1 = df.copy()

In [70]: %%time
...: columns = [f"f{i:04d}" for i in range(N)]
...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 13.9 ms

In [71]: M, N = 1000, 1000
...: labels = np.random.choice(['good', 'bad'], size=M)
...: ids = np.random.choice(list(string.ascii_lowercase), size=M)
...: features = np.empty((M,), dtype=object)
...: features[:] = list(map(list, np.random.randn(M, N)))
...: df = pd.DataFrame([labels, ids, features], index=['label', 'id', 'features']).T
...: df1 = df.copy()

In [72]: %%time
...: columns = [f"f{i:04d}" for i in range(N)]
...: features = pd.DataFrame(list(map(np.asarray, df1.pop('features').to_numpy())), index=df.index, columns=columns)
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 627 ms

In [73]: df1.shape
Out[73]: (1000, 1002)

编辑:比原始速度快约 2 倍

In [79]: df2 = df.copy()

In [80]: %%time
...: features = df2.pop('features')
...: for i in range(N):
...: df2[f'f{i:04d}'] = features.map(lambda x: x[i])
...:
Wall time: 1.46 s

In [81]: df1.equals(df2)
Out[81]: True

编辑:编辑:构建 DataFrame 的更快方法比原始方法提高了 8 倍:

In [22]: df1 = df.copy()

In [23]: %%time
...: features = pd.DataFrame({f"f{i:04d}": np.asarray(row) for i, row in enumerate(df1.pop('features').to_numpy())})
...: df1 = pd.concat([df1, features], axis=1)
Wall time: 165 ms

关于python - 将数组扩展到 dask 数据框中的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58384320/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com