gpt4 book ai didi

Python - 加快将分类变量转换为其数字索引

转载 作者:太空狗 更新时间:2023-10-29 22:25:28 25 4
gpt4 key购买 nike

我需要将 Pandas 数据框中的一列分类变量转换为一个数值,该数值对应于该列中唯一分类变量数组的索引(长话短说!),这是实现该操作的代码片段:

import pandas as pd
import numpy as np

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
uniq_lab = np.unique(df['col'])

for lab in uniq_lab:
df['col'].replace(lab,np.where(uniq_lab == lab)[0][0].astype(float),inplace=True)

转换数据框:

    col
0 baked
1 beans
2 baked
3 baked
4 beans

进入数据框:

    col
0 0.0
1 1.0
2 0.0
3 0.0
4 1.0

随心所欲。但我的问题是,当我尝试在大数据文件上运行类似代码时,我愚蠢的小 for 循环(我想到的唯一方法)像糖蜜一样慢。我只是想知道是否有人对是否有任何方法可以更有效地做到这一点有任何想法。提前感谢您的任何想法。

最佳答案

使用factorize :

df['col'] = pd.factorize(df.col)[0]
print (df)
col
0 0
1 1
2 0
3 0
4 1

Docs

编辑:

作为Jeff评论中提到,那么最好是将列转换为 categorical 主要是因为 less memory usage :

df['col'] = df['col'].astype("category")

时间:

有趣的是,在大型 df 中,pandasnumpy 更快。我简直不敢相信。

len(df)=500k:

In [29]: %timeit (a(df1))
100 loops, best of 3: 9.27 ms per loop

In [30]: %timeit (a1(df2))
100 loops, best of 3: 9.32 ms per loop

In [31]: %timeit (b(df3))
10 loops, best of 3: 24.6 ms per loop

In [32]: %timeit (b1(df4))
10 loops, best of 3: 24.6 ms per loop

len(df)=5k:

In [38]: %timeit (a(df1))
1000 loops, best of 3: 274 µs per loop

In [39]: %timeit (a1(df2))
The slowest run took 6.71 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 273 µs per loop

In [40]: %timeit (b(df3))
The slowest run took 5.15 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 295 µs per loop

In [41]: %timeit (b1(df4))
1000 loops, best of 3: 294 µs per loop

len(df)=5:

In [46]: %timeit (a(df1))
1000 loops, best of 3: 206 µs per loop

In [47]: %timeit (a1(df2))
1000 loops, best of 3: 204 µs per loop

In [48]: %timeit (b(df3))
The slowest run took 6.30 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

In [49]: %timeit (b1(df4))
The slowest run took 6.44 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 164 µs per loop

测试代码:

d = {'col': ["baked","beans","baked","baked","beans"]}
df = pd.DataFrame(data=d)
print (df)
df = pd.concat([df]*100000).reset_index(drop=True)
#test for 5k
#df = pd.concat([df]*1000).reset_index(drop=True)


df1,df2,df3, df4 = df.copy(),df.copy(),df.copy(),df.copy()

def a(df):
df['col'] = pd.factorize(df.col)[0]
return df

def a1(df):
idx,_ = pd.factorize(df.col)
df['col'] = idx
return df

def b(df):
df['col'] = np.unique(df['col'],return_inverse=True)[1]
return df

def b1(df):
_,idx = np.unique(df['col'],return_inverse=True)
df['col'] = idx
return df

print (a(df1))
print (a1(df2))
print (b(df3))
print (b1(df4))

关于Python - 加快将分类变量转换为其数字索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37672704/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com