gpt4 book ai didi

python - 提高 pandas 数据框的性能

转载 作者:行者123 更新时间:2023-12-01 08:03:11 25 4
gpt4 key购买 nike

我正在尝试对 person_id 值进行编码。首先,我创建一个存储 person_id 值的字典,然后将这些值添加到新列中。处理 70K 行数据大约需要 25 分钟。

数据集:https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop

interactions_df = pd.read_csv('./users_interactions.csv')

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
personId_map[i[0]] = i[1]

运行

%%time

def transform_person_id(row):
if row['personId'] in personId_map.values():
return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df.head(3)

消耗时间

CPU times: user 25min 46s, sys: 1.58 s, total: 25min 48s
Wall time: 25min 50s

如何优化上面的代码。

最佳答案

如果没有特殊的订购规则,使用factorize :

interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]

如果还需要字典:

i, v = pd.factorize(interactions_df.personId)
personId_map = dict(zip(i, v[i]))

数据 - 用于测试的前 20 行:

interactions_df = pd.read_csv('./users_interactions.csv', nrows=20, usecols=['personId'])

#print (interactions_df)

personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))

for i in zip(personId_len, set(interactions_df['personId'])):
personId_map[i[0]] = i[1]

#print (personId_map)

def transform_person_id(row):
if row['personId'] in personId_map.values():
return int([k for k,v in personId_map.items() if v == row['personId']][0])

interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df['new_personId1'] = pd.factorize(interactions_df.personId)[0]
<小时/>
print (interactions_df)
personId new_personId new_personId1
0 -8845298781299428018 3 0
1 -1032019229384696495 5 1
2 -1130272294246983140 9 2
3 344280948527967603 6 3
4 -445337111692715325 0 4
5 -8763398617720485024 10 5
6 3609194402293569455 4 6
7 4254153380739593270 8 7
8 344280948527967603 6 3
9 3609194402293569455 4 6
10 3609194402293569455 4 6
11 1908339160857512799 11 8
12 1908339160857512799 11 8
13 1908339160857512799 11 8
14 7781822014935525018 1 9
15 8239286975497580612 2 10
16 8239286975497580612 2 10
17 -445337111692715325 0 4
18 2766187446275090740 7 11
19 1908339160857512799 11 8
<小时/>
i, v = pd.factorize(interactions_df.personId)
d = dict(zip(i, v[i]))
print (d)
{0: -8845298781299428018, 1: -1032019229384696495, 2: -1130272294246983140,
3: 344280948527967603, 4: -445337111692715325, 5: -8763398617720485024,
6: 3609194402293569455, 7: 4254153380739593270, 8: 1908339160857512799,
9: 7781822014935525018, 10: 8239286975497580612, 11: 2766187446275090740}

性能:

interactions_df = pd.read_csv('./users_interactions.csv', usecols=['personId'])

#print (interactions_df)

In [243]: %timeit interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]
2.03 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

关于python - 提高 pandas 数据框的性能,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55649690/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com