gpt4 book ai didi

python - Pandas,groupby 和比较的处理时间较长

转载 作者:太空宇宙 更新时间:2023-11-03 16:48:05 24 4
gpt4 key购买 nike

我是 pandas 的新手,但通过 stackoverflow,已经可以正常工作了。目前该方法可行,但需要大约 30 分钟(相当大的数据集)。想知道是否有办法加快速度?本质上是尝试绘制“Status”列与“Current_Status”列的各种不同组合。谢谢!

df_new = df.groupby('id').apply(lambda x: pd.Series(dict(   
new_col1=(x['foo'] != np.nan).sum(),
new_col2=(x['bar'] == 'P').sum(),
new_col3=(x['bar'] == 'C').sum(),
new_col3=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),
new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),
new_col5=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum()
)))

df结构示例:

In[15]: df.head(6)
Out[15]:
id foo bar Status Current_Status
0 1 23 'C' 'Approved, paid' 'Approved, paid'
1 1 63 'P' 'Approved, not yet paid' 'Approved, paid'
2 1 84 'P' 'Approved, paid' 'Approved, paid'
3 1 125 'P' 'Approved, not yet paid' 'Approved, not yet paid'
4 1 216 'P' 'Approved, not yet paid' 'Approved, paid'
5 1 12 'C' 'Approved, paid' 'Approved, paid'

最佳答案

您可以尝试notnullnumpy.in1d :

df_new1 = df.groupby('id').apply(lambda x: pd.Series(dict(
new_col1=(x['foo'].notnull()).sum(),
new_col2=np.in1d(x['bar'],'P').sum(),
new_col3=np.in1d(x['bar'],'C').sum(),
new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),
new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),
new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum()
)))

另一个更快的解决方案将值转换为值 01 by factorize ,然后通过 abs 创建倒排最后groupbysum :

df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] = df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()

df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']

print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]

或者您可以创建 bool 系列 - 最快的解决方案:

df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status = np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status

df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df

print df.groupby('id').sum()
[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)

时间:

In [25]: len(df)
Out[25]: 110000

In [26]: %timeit a(df)
10 loops, best of 3: 24.7 ms per loop

In [27]: %timeit b(df1)
10 loops, best of 3: 39.3 ms per loop

In [28]: %timeit c(df2)
10 loops, best of 3: 46 ms per loop

In [29]: %timeit d(df3)
10 loops, best of 3: 103 ms per loop

代码:

df = pd.concat([df]*10000).reset_index(drop=True)    
#print df
df1,df2,df3 = df.copy(), df.copy(), df.copy()


def a(df):
df['new_col1'] = df['foo'].notnull()
df['new_col2'] = np.in1d(df['bar'], 'P')
df['new_col3'] = ~df['new_col2']
Status = np.in1d(df['Status'],'Approved, not yet paid')
invertStatus = ~Status
Current_Status = np.in1d(df['Current_Status'],'Approved, not yet paid')
invertCurrent_Status = ~Current_Status
df['new_col4'] = Status & invertCurrent_Status
df['new_col5'] = Status & Current_Status
df['new_col6'] = invertStatus & invertCurrent_Status
#print df
return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']].astype(int)

def b(df):
df['new_col1'] = df['foo'].notnull().astype(int)
df['new_col2'] = df['bar'].factorize()[0]
df['new_col3'] = (df['new_col2'] - 1).abs()
df['Status'] = df['Status'].factorize()[0]
df['invertStatus'] = (df['Status'] - 1).abs()
df['Current_Status'] = df['Current_Status'].factorize()[0]
df['invertCurrent_Status'] = (df['Current_Status'] - 1).abs()

df['new_col4'] = df['Status'] & df['invertCurrent_Status']
df['new_col5'] = df['Status'] & df['Current_Status']
df['new_col6'] = df['invertStatus'] & df['invertCurrent_Status']

return df.groupby('id').sum()[['new_col1','new_col2','new_col3','new_col4','new_col5','new_col6']]
def c(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'].notnull()).sum(),new_col2=np.in1d(x['bar'],'P').sum(),new_col3=np.in1d(x['bar'],'C').sum(),new_col4=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),new_col5=(np.in1d(x['Status'],['Approved, not yet paid']) & np.in1d(x['Current_Status'],['Approved, not yet paid'])).sum(),new_col6=(np.in1d(x['Status'],['Approved, paid']) & np.in1d(x['Current_Status'],['Approved, paid'])).sum(),)))

def d(df):
return df.groupby('id').apply(lambda x: pd.Series(dict(new_col1=(x['foo'] != np.nan).sum(),new_col2=(x['bar'] == 'P').sum(),new_col3=(x['bar'] == 'C').sum(),new_col4=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, paid')).sum(),new_col5=((x['Status']=='Approved, not yet paid') & (x['Current_Status']=='Approved, not yet paid')).sum(),new_col6=((x['Status']=='Approved, paid') & (x['Current_Status']=='Approved, paid')).sum())))

测试 DataFrame:

    id  foo bar                  Status          Current_Status
0 1 23 C Approved, paid Approved, paid
1 1 63 P Approved, not yet paid Approved, paid
2 1 84 P Approved, paid Approved, paid
3 1 125 P Approved, not yet paid Approved, not yet paid
4 1 12 C Approved, paid Approved, paid
5 2 23 C Approved, paid Approved, paid
6 2 63 P Approved, not yet paid Approved, paid
7 2 84 P Approved, paid Approved, paid
8 2 125 P Approved, not yet paid Approved, not yet paid
9 2 216 P Approved, not yet paid Approved, paid
10 2 12 C Approved, paid Approved, paid

关于python - Pandas,groupby 和比较的处理时间较长,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36132425/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com