gpt4 book ai didi

pandas - 同时运行 df.apply、dask 和 pd.get_dummies

转载 作者:行者123 更新时间:2023-12-04 01:26:10 26 4
gpt4 key购买 nike

我有多个分类列,这些分类列中有数百万个不同的值。因此,我使用 daskpd.get_dummies 将这些分类列转换为位向量。像这样:

import pandas as pd
import numpy as np
import scipy.sparse
import dask.dataframe as dd
import multiprocessing

train_set = pd.read_csv('train_set.csv')

def convert_into_one_hot (col1, col2):
return pd.get_dummies(train_set, columns=[col1, col2], sparse=True)

ddata = dd.from_pandas(train_set, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')

但是,我得到这个错误:

ValueError: Metadata inference failed in `lambda`.

You have supplied a custom function and Dask is unable to determine the type of output that that function returns.

To resolve this please provide a meta= keyword.
The docstring of the Dask function you ran should have more information.

Original error is below:
------------------------
KeyError("None of [Index(['foo'], dtype='object')] are in the [columns]")

我在这里做错了什么?谢谢。

编辑:

重现错误的小例子。希望对理解问题有所帮助。

def convert_into_one_hot (x, y):
return pd.get_dummies(df, columns=[x, y], sparse=True)
d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)
dd.from_pandas(df, npartitions=2*multiprocessing.cpu_count()).map_partitions(lambda df: df.apply((lambda row: convert_into_one_hot(row.col1, row.col2)), axis=1)).compute(scheduler='processes')

最佳答案

我认为如果您尝试在分区内使用 get_dummies 可能会遇到一些问题。有一个 dask 版本,应该按如下方式工作

import pandas as pd
import dask.dataframe as dd
import multiprocessing as mp

d = {'col1': ['a', 'b'], 'col2': ['c', 'd']}
df = pd.DataFrame(data=d)

Pandas

pd.get_dummies(df, columns=["col1", "col2"], sparse=True)

任务

ddf = dd.from_pandas(df, npartitions=2 * mp.cpu_count())

# you need to converts columns dtypes to category
dummies_cols = ["col1", "col2"]
ddf[dummies_cols] = ddf[dummies_cols].categorize()

dd.get_dummies(ddf, columns=["col1", "col2"], sparse=True)

关于pandas - 同时运行 df.apply、dask 和 pd.get_dummies,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61903376/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com