gpt4 book ai didi

python - pandas df 对多列中的部分字符串值进行过滤

转载 作者:行者123 更新时间:2023-12-01 09:16:53 24 4
gpt4 key购买 nike

我最近开始使用 pandas,我偶然发现了我无法解决的(可能是微不足道的)问题。虽然在纯 Python 脚本中更容易做到这一点,但我真的希望在 pandas 中做到这一点。这是我的菜鸟问题。

具有以下数据框:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1,str2,str3@num val str1,str2,str3@num val str1,str2,str3@num
ID2 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num
ID3 val str1,str2,str3@num val str1,str1,str3@num val str4,str2,str3@num
ID4 val str1,str2,str3@num val str2,str2,str3@num val str1,str2,str3@num
ID5 val str4,str5,str63@num val str4,str5,st63@num val str4,str5,str63@num

我想编写一个函数来仅保留 n 列中具有最少所需质量分数的行。只有字符串的第一部分真正重要,因此首先仅选择字符串的第一部分:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID1 val str1 val str1 val str1
ID2 val str4 val str4 val str4
ID3 val str1 val str1 val str4
ID4 val str1 val str2 val str1
ID5 val str4 val str3 val str4

假设我只想在两列中保留最低分数为“str4”的行,我可能会计算跨列的百分比:

 ID Sample1 quality1    Sample2 quality2    Sample3 quality3
ID2 val str4 val str4 val str4
ID5 val str4 val str3 val str4

这就是我开始玩它的方式,只是为了知道东西在哪里,但我仍然无法将东西放回原处:

for i,rows in enumerate(table_test.values):
min_val = "str4"
scores = rows[2::2]
lists = np.ndarray.tolist(scores)
for list in lists:
first_str = list.split(",")
print(i, first_str[0])

感谢您的想法或/和帮助!

最佳答案

使用boolean indexing使用 bool 掩码进行过滤:

min_val = "str4"
df = df[df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val

quality3
1 str4,str5,str63@num
4 str4,str5,str63@num

或者:

min_val = "str4"
df = df[df.filter(like='quality').applymap(lambda x: x.startswith(min_val)).sum(axis=1) >= 2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 \
1 ID2 val str4,str5,str63@num val str4,str5,st63@num val
4 ID5 val str4,str5,str63@num val str4,str5,st63@num val

quality3
1 str4,str5,str63@num
4 str4,str5,str63@num

说明:

第一filter所有带有 quality 字符串的列:

print (df.filter(like='quality'))
quality1 quality2 quality3
0 str1,str2,str3@num str1,str2,str3@num str1,str2,str3@num
1 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num
2 str1,str2,str3@num str1,str1,str3@num str4,str2,str3@num
3 str1,str2,str3@num str2,str2,str3@num str1,str2,str3@num
4 str4,str5,str63@num str4,str5,st63@num str4,str5,str63@num

startswith 比较所有列对于 bool 数据帧:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)))
quality1 quality2 quality3
0 False False False
1 True True True
2 False False True
3 False False False
4 True True True

通过 sum 计算 True 的值 - True 的过程类似于 1:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1))
0 0
1 3
2 1
3 0
4 3
dtype: int64

按阈值比较:

print (df.filter(like='quality').apply(lambda x: x.str.startswith(min_val)).sum(axis=1) >=2)
0 False
1 True
2 False
3 False
4 True
dtype: bool
<小时/>

如果还想先拆分qualitysplit所有 quality 列并分配回:

min_val = "str4"
cols = df.filter(like='quality').columns

df[cols] = df[cols].apply(lambda x: x.str.split(',').str[0])
#another solution
#df[cols] = df[cols].applymap(lambda x: x.split(',')[0])
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
0 ID1 val str1 val str1 val str1
1 ID2 val str4 val str4 val str4
2 ID3 val str1 val str1 val str4
3 ID4 val str1 val str2 val str1
4 ID5 val str4 val str4 val str4

然后通过 min_val 比较 bool DataFrame 并以与之前相同的方式进行过滤:

df = df[(df[cols] == min_val).sum(axis=1) >=2]
print (df)
ID Sample1 quality1 Sample2 quality2 Sample3 quality3
1 ID2 val str4 val str4 val str4
4 ID5 val str4 val str4 val str4

关于python - pandas df 对多列中的部分字符串值进行过滤,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51172597/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com