gpt4 book ai didi

python - 比较表以创建存在/不存在矩阵,填充空而不带小数

转载 作者:行者123 更新时间:2023-12-01 02:06:07 25 4
gpt4 key购买 nike

命令行:

文件可以在 git-hub 上找到。

文件1:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file1.txt

文件2:

https://raw.githubusercontent.com/felipelira/files_to_test/master/file2.txt

命令行: python teste2.py file1.txt file2.txt 测试

当转换存在/不存在矩阵中的表格文件时,我最终错过了一些数据。未绘制与种质不匹配的基因组。

我之前的结果是这样的(根据帖子 Convert tables to presence/absence matrix python - Solved 中的脚本和示例):

genome  accession1  accession2  accession3  accession4  accession5
genome1 1 1 1 0 0
genome2 1 0 0 1 1

但我在前瞻性分析中需要其他基因组。我尝试将定义 df2 的 block 移动到 df1 之前:

asmbly_dict = sys.argv[1]
blast_result = sys.argv[2]
outName = sys.argv[3] + '.txt'

with open(blast_result, 'r') as file2:
col_genes = ['gene', 'accession']
df2 = pd.read_csv(file2, sep='\t', header=None, names=col_genes)
print df2

with open(asmbly_dict, 'r') as file1:
col_asmbly = ['gene', 'genome']
df1 = pd.read_csv(file1, sep='\t', header=None, names=col_asmbly)
df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])
#print df1
g = df1.groupby('genome')['accession'].apply(list).reset_index()
testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)
#print testdf.to_string(index=False)
testdf.to_csv(outName, sep='\t', header=True, index=False)

打印 df2:

    gene   accession
0 gene1 accession1
1 gene2 accession2
2 gene3 accession3
3 gene4 accession1
4 gene5 accession4
5 gene6 accession5

打印 df1:

    gene   genome   accession
0 gene1 genome1 accession1
1 gene2 genome1 accession2
2 gene3 genome1 accession3
3 gene4 genome2 accession1
4 gene5 genome2 accession4
5 gene6 genome2 accession5
6 gene7 genome3 NaN
7 gene8 genome3 NaN
8 gene9 genome4 NaN

打印 testdf:

genome  accession1  accession2  accession3  accession4  accession5
genome1 1.0 1.0 1.0 0.0 0.0
genome2 1.0 0.0 0.0 1.0 1.0
genome3 NaN NaN NaN NaN NaN
genome4 NaN NaN NaN NaN NaN

以及 .csv 文件:

genome  accession1  accession2  accession3  accession4  accession5
genome1 1.0 1.0 1.0 0.0 0.0
genome2 1.0 0.0 0.0 1.0 1.0
genome3
genome4

问题是:

如何在数字后不绘制小数点(1.0 -> 1)以及如何用零填充空值以打印和写入文件?

最佳答案

如果想使用原始解决方案,请添加 fillna强制转换为 int:

testdf = g.join(pd.get_dummies(g['accession'].apply(pd.Series).stack()).sum(level=0)).drop('accession', 1)

testdf = testdf.fillna(0).astype(int)

但更好的解决方案是使用 get_dummies然后设置每个索引和每列的 max (在示例中不需要,在实际数据中可能):

df1['accession'] = df1['gene'].map(df2.set_index('gene')['accession'])

df1 = pd.get_dummies(df1.set_index('genome')['accession']).max(level=0).max(level=0, axis=1)

或者使用crosstab , clip_upper并通过 reindex 添加缺失的类别:

df1 = (pd.crosstab(df1['genome'], df1['accession'])
.clip_upper(1)
.reindex(df1['genome'].unique(), fill_value=0))

或者:

df1 = (df1.groupby(['genome', 'accession'])
.size()
.clip_upper(1)
.unstack(fill_value=0)
.reindex(df1['genome'].unique(), fill_value=0))
<小时/>
print (df1)
accession1 accession2 accession3 accession4 accession5
genome
genome1 1 1 1 0 0
genome2 1 0 0 1 1
genome3 0 0 0 0 0
genome4 0 0 0 0 0

最后写入文件:

df1.to_csv(outName, sep='\t')

关于python - 比较表以创建存在/不存在矩阵,填充空而不带小数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49050710/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com