gpt4 book ai didi

python - 使用 apply() 自定义函数创建新列时出现 Pandas 内存错误

转载 作者:行者123 更新时间:2023-12-01 06:26:47 40 4
gpt4 key购买 nike

计算 2 次重复的平均 log(1+TPM) 的函数

def average_TPM(a,b):
log_a = np.log(1+a)
log_b = np.log(1+b)
if log_a > 0.1 and log_b > 0.1:
avg = np.mean([log_a,log_b])
else:
avg = np.nan
return avg

将函数应用于 df 以创建新列

df.loc[:,'leaf'] = df.apply(lambda row:  average_TPM(row['leaf1'],row['leaf2']),axis=1)
df.loc[:,'flag_leaf'] = df.apply(lambda row: average_TPM(row['flag_leaf1'],row['flag_leaf2']),axis=1)
df.loc[:,'anther'] = df.apply(lambda row: average_TPM(row['anther1'],row['anther2']),axis=1)
df.loc[:,'premeiotic'] = df.apply(lambda row: average_TPM(row['premeiotic1'],row['premeiotic2']),axis=1)
df.loc[:,'leptotene'] = df.apply(lambda row: average_TPM(row['leptotene1'],row['leptotene2']),axis=1)
df.loc[:,'zygotene'] = df.apply(lambda row: average_TPM(row['zygotene1'],row['zygotene2']),axis=1)
df.loc[:,'pachytene'] = df.apply(lambda row: average_TPM(row['pachytene1'],row['pachytene2']),axis=1)
df.loc[:,'diplotene'] = df.apply(lambda row: average_TPM(row['diplotene1'],row['diplotene2']),axis=1)
df.loc[:,'metaphase_I'] = df.apply(lambda row: average_TPM(row['metaphaseI_1'],row['metaphaseI_2']),axis=1)
df.loc[:,'metaphase_II'] = df.apply(lambda row: average_TPM(row['metaphaseII_1'],row['metaphaseII_2']),axis=1)
df.loc[:,'pollen'] = df.apply(lambda row: average_TPM(row['pollen1'],row['pollen2']),axis=1)

最佳答案

不确定为什么会出现内存错误,但您可以向量化您的问题:

#dummy variable
np.random.seed = 2
df = pd.DataFrame(np.random.random(8*4).reshape(8,-1), columns=['a1','a2','b1','b2'])
print (df)
a1 a2 b1 b2
0 0.416493 0.964483 0.089547 0.218952
1 0.655331 0.468490 0.272494 0.652915
2 0.680433 0.461191 0.919223 0.552074
3 0.077158 0.138839 0.385818 0.462848
4 0.149198 0.912372 0.893708 0.081125
5 0.255422 0.143502 0.466123 0.524544
6 0.842095 0.486603 0.628405 0.686393
7 0.329461 0.714052 0.176126 0.566491

定义要创建的列列表,然后使用 np.log1p一次性获取全部数据

col_create = ['a','b'] #what you need to redefine for your problem
col_get = [f'{col}{i}'for col in col_create for i in range(1,3)] #to ensure the order od columns
arr_log = np.log1p(df[col_get].to_numpy())

现在您可以使用 np.where 和矢量化比较来分配新列:

df = df.assign(**pd.DataFrame( np.where( (arr_log[:,::2]>0.1)&(arr_log[:,1::2]>0.1), 
(arr_log[:,::2] + arr_log[:,1::2])/2., np.nan),
columns=col_create, index=df.index))
print (df)
a1 a2 b1 b2 a b
0 0.533141 0.695231 0.909976 0.441877 0.477569 0.506518
1 0.961887 0.872382 0.064593 0.030619 0.650559 NaN
2 0.646332 0.912140 0.615057 0.354700 0.573386 0.391475
3 0.019646 0.926524 0.160417 0.676512 NaN 0.332748
4 0.249448 0.474937 0.349048 0.390213 0.305659 0.314428
5 0.046568 0.985072 0.147037 0.161261 NaN 0.143344
6 0.812421 0.750128 0.861377 0.765981 0.577176 0.595012
7 0.950178 0.397550 0.803165 0.156186 0.501321 0.367335

关于python - 使用 apply() 自定义函数创建新列时出现 Pandas 内存错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60100733/

40 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com