gpt4 book ai didi

python - Pandas 改进

转载 作者:行者123 更新时间:2023-12-01 03:18:35 26 4
gpt4 key购买 nike

我目前有一个 Pandas Dataframe,我在其中执行列之间的比较。我发现一种情况,在进行比较时存在空列,由于某种原因比较返回 else 值。我添加了一个额外的语句来将其清理为空。看看我是否可以简化这个并有一个单一的声明。

df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''

代码

    df = pd.DataFrame({
'a_id': ['A', 'B', 'C', 'D', '', 'F', ''],
'a_score': [1, 2, 3, 4, '', 6, ''],
'b_id': ['a', 'b', 'c', 'd', 'e', 'f', ''],
'b_score': [0.1, 0.2, 3.1, 4.1, 5, 5.99, ''],

})
print df
# Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

# Calculate higher score
df['doc_id'] = df.apply(lambda df: df['a_id'] if df['a_score'] >= df['b_score'] else df['b_id'], axis=1)

# Select type based on higher score
df['doc_type'] = df.apply(lambda df: 'a' if df['a_score'] >= df['b_score'] else 'b', axis=1)
print df
# Update type when is empty
df['doc_type'].loc[(df['a_id'].isnull() & df['b_id'].isnull())] = ''
print df

最佳答案

您可以使用numpy.where而是 apply,也可通过 boolean indexing 选择与列最好使用此解决方案:

df.loc[mask, 'colname'] = val
<小时/>
 # Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

# Calculate higher score
df['doc_id'] = np.where(df['a_score'] >= df['b_score'], df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(df['a_score'] >= df['b_score'], 'a', 'b')
print (df)
# Update type when is empty
df.loc[(df['a_id'].isnull() & df['b_id'].isnull()), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN

mask 的替代方案为 DataFrame.all用于检查行中是否所有 True - axis=1:

print (df[['a_id', 'b_id']].isnull())
a_id b_id
0 False False
1 False False
2 False False
3 False False
4 True False
5 False False
6 True True

print (df[['a_id', 'b_id']].isnull().all(axis=1))
0 False
1 False
2 False
3 False
4 False
5 False
6 True
dtype: bool

df.loc[df[['a_id', 'b_id']].isnull().all(axis=1), 'doc_type'] = ''
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN

Bur 更好的是使用 double numpy.where:

 # Replace empty string with NaN
df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

#create masks to series - not compare twice
mask = df['a_score'] >= df['b_score']
mask1 = (df['a_id'].isnull() & df['b_id'].isnull())
#altrnative solution for mask1
#mask1 = df[['a_id', 'b_id']].isnull().all(axis=1)
# Calculate higher score
df['doc_id'] = np.where(mask, df['a_id'], df['b_id'])
# Select type based on higher score
df['doc_type'] = np.where(mask, 'a', np.where(mask1, '', 'b'))
print (df)
a_id a_score b_id b_score doc_id doc_type
0 A 1.0 a 0.10 A a
1 B 2.0 b 0.20 B a
2 C 3.0 c 3.10 c b
3 D 4.0 d 4.10 d b
4 NaN NaN e 5.00 e b
5 F 6.0 f 5.99 F a
6 NaN NaN NaN NaN NaN

关于python - Pandas 改进,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42221746/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com