gpt4 book ai didi

python - 如何计算数据框中同一列中特定值后面的行数

转载 作者:行者123 更新时间:2023-12-04 08:18:20 26 4
gpt4 key购买 nike

考虑我有以下数据框:

tempDic = {0: {0: 'class([1,0,0,0],"Small-molecule metabolism ").', 1: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 2: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 3: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 4: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 5: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 6: 'class([1,1,0,0],"Degradation ").', 7: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 8: 'function(tb536,[1,1,1,0],\'galE2\',"UDP-glucose 4-epimerase").', 9: 'function(tb620,[1,1,1,0],\'galK\',"galactokinase").', 10: 'function(tb619,[1,1,1,0],\'galT\',"galactose-1-phosphate uridylyltransferase C-term").', 11: 'class([1,1,1,0],"Carbon compounds ").', 12: 'function(tb186,[1,1,1,0],\'bglS\',"beta-glucosidase").', 13: 'function(tb2202,[1,1,1,0],\'cbhK\',"carbohydrate kinase").', 14: 'function(tb727,[1,1,1,0],\'fucA\',"L-fuculose phosphate aldolase").', 15: 'function(tb1731,[1,1,1,0],\'gabD1\',"succinate-semialdehyde dehydrogenase").', 16: 'function(tb234,[1,1,1,0],\'gabD2\',"succinate-semialdehyde dehydrogenase").', 17: 'function(tb501,[1,1,1,0],\'galE1\',"UDP-glucose 4-epimerase").', 18: 'class([1,1,1,0],"xyz ").'}}

df = pd.DataFrame(tempDic)
print(df)



0
0 class([1,0,0,0],"Small-molecule metabolism ").
1 function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
2 function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
3 function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
4 function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
5 function(tb234,[1,1,1,0],'gabD2',"succinate-se...
6 class([1,1,0,0],"Degradation ").
7 function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
8 function(tb536,[1,1,1,0],'galE2',"UDP-glucose ...
9 function(tb620,[1,1,1,0],'galK',"galactokinase").
10 function(tb619,[1,1,1,0],'galT',"galactose-1-p...
11 class([1,1,1,0],"Carbon compounds ").
12 function(tb186,[1,1,1,0],'bglS',"beta-glucosid...
13 function(tb2202,[1,1,1,0],'cbhK',"carbohydrate...
14 function(tb727,[1,1,1,0],'fucA',"L-fuculose ph...
15 function(tb1731,[1,1,1,0],'gabD1',"succinate-s...
16 function(tb234,[1,1,1,0],'gabD2',"succinate-se...
17 function(tb501,[1,1,1,0],'galE1',"UDP-glucose ...
18 class([1,1,1,0],"xyz ").
我需要的是一种策略,它会给我这样的结果:
Class                         Count
Small-molecule metabolism 5
Degradation 4
Carbon compounds 6
xyz 0
以“class”开头的每一行都包含双引号中的类名称,例如第一行中的“小分子代谢”。该行之后是以“function”开头的行。我们只需要计算那些以“function”开头的行,并将该计数放在该类名的前面。
后面没有“函数”行的类应该被赋值为 0,这意味着该类具有零个函数。

最佳答案

使用 Series.str.startswith 对于掩码,获取 "" 之间的值来自 Series.str.extract 并在向前填充缺失值后使用 GroupBy.size 带减法 1 :

df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)

df['Class'] = df['Class'].ffill()

s = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (s)
Class Count
0 Small-molecule metabolism 5
1 Degradation 4
2 Carbon compounds 6
3 xyz 0
详细步骤:
print(df.loc[df[0].str.startswith('class'), 0])
0 class([1,0,0,0],"Small-molecule metabolism ").
6 class([1,1,0,0],"Degradation ").
11 class([1,1,1,0],"Carbon compounds ").
18 class([1,1,1,0],"xyz ").
Name: 0, dtype: object

print (df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False))
0 Small-molecule metabolism
6 Degradation
11 Carbon compounds
18 xyz
Name: 0, dtype: object
df['Class'] = df.loc[df[0].str.startswith('class'), 0].str.extract('"(.+)"', expand=False)
print (df['Class'])
0 Small-molecule metabolism
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 Degradation
7 NaN
8 NaN
9 NaN
10 NaN
11 Carbon compounds
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 xyz
Name: Class, dtype: object
df['Class'] = df['Class'].ffill()
print (df['Class'])
0 Small-molecule metabolism
1 Small-molecule metabolism
2 Small-molecule metabolism
3 Small-molecule metabolism
4 Small-molecule metabolism
5 Small-molecule metabolism
6 Degradation
7 Degradation
8 Degradation
9 Degradation
10 Degradation
11 Carbon compounds
12 Carbon compounds
13 Carbon compounds
14 Carbon compounds
15 Carbon compounds
16 Carbon compounds
17 Carbon compounds
18 xyz
Name: Class, dtype: object
print (df.groupby('Class', sort=False).size())
Class
Small-molecule metabolism 6
Degradation 5
Carbon compounds 7
xyz 1
dtype: int64

df1 = df.groupby('Class', sort=False).size().sub(1).reset_index(name='Count')
print (df1)
Class Count
0 Small-molecule metabolism 5
1 Degradation 4
2 Carbon compounds 6
3 xyz 0

关于python - 如何计算数据框中同一列中特定值后面的行数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65608095/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com