gpt4 book ai didi

python - 使用python计算tsv文件的列中单词的出现次数

转载 作者:行者123 更新时间:2023-11-30 23:26:58 24 4
gpt4 key购买 nike

Python初学者的问题!我有一个如下所示的 tsv 文件:

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668
WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
WHI5 YOR083W SWI4 YER111C physical interactions 15210110
WHI5 YOR083W SWI4 YER111C physical interactions 15210111

我想统计第[3]行中包含相同单词的所有行,并只输出第一个在新列中出现的次数。

WHI5    YOR083W CDC28   YBR160W physical interactions   19823668    3
WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2

到目前为止,我尝试了“csv”和“Counter”或“pandas”和“Counter”的组合,但没有成功......

最佳答案

使用 Pandas :

>>> import pandas as pd
>>> from io import BytesIO
>>> df = pd.read_table(BytesIO("""\
... col1 col2 col3 col4 col5 col6
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 19823668
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 21658602
... WHI5 YOR083W CDC28 YBR160W "physical interactions" 24186061
... WHI5 YOR083W RPD3 YNL330C "physical interactions" 19823668
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210110
... WHI5 YOR083W SWI4 YER111C "physical interactions" 15210111"""),
... delim_whitespace=True)

pandas 数据框将如下所示:

>>> df
col1 col2 col3 col4 col5 col6
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111

[6 rows x 6 columns]

要获取计数,请按 col3 分组并获取每组的长度:

>>> df['cnt'] = df.groupby('col3')['col3'].transform(len)
>>> df
col1 col2 col3 col4 col5 col6 cnt
0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
1 WHI5 YOR083W CDC28 YBR160W physical interactions 21658602 3
2 WHI5 YOR083W CDC28 YBR160W physical interactions 24186061 3
3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2
5 WHI5 YOR083W SWI4 YER111C physical interactions 15210111 2

[6 rows x 7 columns]

选择每组中的第一个:

>>> df.groupby('col3').apply(lambda obj: obj.head(n=1))
col1 col2 col3 col4 col5 col6 cnt
col3
CDC28 0 WHI5 YOR083W CDC28 YBR160W physical interactions 19823668 3
RPD3 3 WHI5 YOR083W RPD3 YNL330C physical interactions 19823668 1
SWI4 4 WHI5 YOR083W SWI4 YER111C physical interactions 15210110 2

[3 rows x 7 columns]

关于python - 使用python计算tsv文件的列中单词的出现次数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22309807/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com