gpt4 book ai didi

python - 使用python减少表格爆炸输出文件中的命中数

转载 作者:行者123 更新时间:2023-11-30 22:54:38 27 4
gpt4 key购买 nike

我有一个表格格式的大型blast文件,其中目标序列的数量不受限制,因此解析需要很长时间。我想将每个查询序列的命中数减少到前 10 个。我的 python 是基本的,但这是我到目前为止所拥有的

import sys

blastfile = open(sys.argv[1],"r")

column1list=[]

for line in blastfile:
b = line.split()[0]
column1list.append(b)

uniqcolumn1 = list(set(column1list))

counter = 0

for val in uniqcolumn1:
#print val
for line in blastfile:
#print line
while counter <= 10:
if line.startswith(val):
print line
counter =+ 1

这是blast输出文件的一行示例,查询序列的标题位于第一列,在本例中为“c8208_g1_i2”

c8208_g1_i2 gi|851252702|ref|WP_048131971.1|    79.30   797 165 0   4881    2491    1   797 0.0 1336    acetyl-CoA decarbonylase/synthase complex subunit alpha [Methanosaeta concilii]

我认为代码的第一部分工作正常,直到'uniqcolumn1 = list(set(column1list))',那么我无法让它打印以列表中每个字符串开头的前十行。

最佳答案

这里的问题似乎是您正在迭代文件对象两次。在 Python 中,文件对象的工作方式很像读取每一行的指针。如果您不向后移动指针,则没有任何内容可读取。

您需要做的是使用.seek函数将此指针移回到开头。例如,假设您有一个 file_to_read.txtpython_script.py

file_to_read.txt

Hello! My name is Bob and I can't think of anything to
put in this file so I'm blabbering on about nonsense
in hopes that you won't realise that this text is not
important but the code in the actually file, though I
think that you wouldn't mind reading this long file.

python_script.py

f = open("file_to_read.txt", "r")
for line in f: print line
for line in f: print line

如果您要运行此代码(并且不会发生有关目录的错误),您只会打印一次file_to_read.txt。要解决这个问题,您只需在读取之间添加 f.seek(0, 0) 即可。例如:

f = open("file_to_read.txt", "r")
for line in f: print line
f.seek(0, 0)
for lien in f: print line

现在,回到您的上下文,您可以看到这如何应用于您的代码:

import sys
# Here is your reading of file
blastfile = open(sys.argv[1],"r")
column1list = []
# Here is the first time you read the file
for line in blastfile:
b = line.split()[0]
column1list.append(b)
# Add a line to move back to the start before the
# next reading
blastfile.seek(0, 0)

uniqcolumn1 = list(set(column1list))

for val in uniqcolumn1:
# Move the counter inside to refresh it after every iteration
counter = 0
# Here is the second time you read your file
for line in blastfile:
while counter <= 10:
if line.startswith(val):
print line
counter += 1
# Since you are going to read the file the next iteration,
# .seek the file
blastfile.seek(0, 0)

编辑

这是代码的后半部分,已修复。您可以这样做:

for val in uniqcolumn1:
# Move the counter in
counter = 0
# Move the while loop out
while counter <= 10:
for line in blastfile:
if line.startswith(val):
print line,
counter += 1
blastfile.seek(0, 0)

这样做的好处是 for 循环提前终止,它不会读取整个文件。

另一种方法是使用这个:

for val in uniqcolumn1:
# Move counter in
counter = 0
# Remove while statement
for line in blastfile:
# Add additional condition to if statement
if line.startswith(val) and counter <= 10:
print line,
counter += 1
elif counter > 10:
break
blastfile.seek(0, 0)

这样做的好处是看起来更简单。

关于python - 使用python减少表格爆炸输出文件中的命中数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37703222/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com