gpt4 book ai didi

python - Pandas 数据帧 : to_dict() poor performance

转载 作者:行者123 更新时间:2023-12-01 01:15:37 26 4
gpt4 key购买 nike

我使用返回大型 pandas 数据帧的 api。我不知道直接迭代数据帧的快速方法,因此我使用 to_dict() 转换为字典。

我的数据转成字典形式后,性能还不错。然而,to_dict() 操作往往是性能瓶颈。

我经常将数据帧的列分组在一起以形成多索引,并使用“索引”方向进行to_dict()。不确定大型多索引是否导致性能不佳。

是否有更快的方法来转换 pandas 数据框?也许有更好的方法来直接迭代数据框而不进行任何强制转换?不确定是否有办法应用矢量化。

下面我给出了模拟计时问题的示例代码:

import pandas as pd
import random as rd
import time

#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)

#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))

#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))


#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))

#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))

谢谢!

最佳答案

常见的指导是不要迭代,在所有行列或分组的行/列上使用函数。下面的第三个代码块显示了如何迭代 numpy 数组,其中 .values 属性。结果是:

枢轴构造需要:0.012315988540649414

数据帧迭代需要:0.32346272468566895

迭代值需要:0.004369020462036133

转换为字典需要:0.023524761199951172

迭代字典需要:0.0010480880737304688

import pandas as pd
from io import StringIO

# Test data
import pandas as pd
import random as rd
import time

#Given a dataframe from api (model as random numbers)
df_columns = ['A','B','C','D','F','G','H','I']
dict_origin = {col:[rd.randint(0,10) for x in range(0,1000)] for col in df_columns}
dict_origin = pd.DataFrame(dict_origin)

#Transform to pivot table
t0 = time.time()
df_pivot = pd.pivot_table(dict_origin,values=df_columns[-3:],index=df_columns[:-3])
t1 = time.time()
print('Pivot Construction takes: ' + str(t1-t0))

#Iterate over all elements in pivot table
t0 = time.time()
for column in df_pivot.columns:
for row in df_pivot[column].index:
test = df_pivot[column].loc[row]
t1 = time.time()
print('Dataframe iteration takes: ' + str(t1-t0))

#Iterate over all values in pivot table
t0 = time.time()
v = df_pivot.values
for row in range(df_pivot.shape[0]):
for column in range(df_pivot.shape[1]):
test = v[row, column]
t1 = time.time()
print('Iteration over values takes: ' + str(t1-t0))


#Iteration over dataframe too slow. Cast to dictionary (bottleneck)
t0 = time.time()
df_pivot = df_pivot.to_dict('index')
t1 = time.time()
print('Cast to dictionary takes: ' + str(t1-t0))

#Iteration over dictionary is much faster
t0 = time.time()
for row in df_pivot.keys():
for column in df_pivot[row]:
test = df_pivot[row][column]
t1 = time.time()
print('Iteration over dictionary takes: ' + str(t1-t0))

关于python - Pandas 数据帧 : to_dict() poor performance,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54381559/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com