gpt4 book ai didi

python - 使用 Biopython 替换文件之间的序列

转载 作者:太空宇宙 更新时间:2023-11-03 17:23:09 25 4
gpt4 key购买 nike

我有两个蛋白质序列 FASTA 文件:

nsp.fasta --> 原始文件

wsp.fasta --> 来自信号肽预测工具的输出文件,该工具返回 nsp.fasta 中的蛋白质,并去除信号。

例如:

在nsp.fasta中记录:

>gi|564250271|ref|XP_006264203.1| PREDICTED: apolipoprotein D [Alligator mississippiensis]MRGMLALLAALLGLLGLVEGQTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQERCVQANYSLKANGKIKVLTKMVRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYALVYSCTSFFWLFHVDYAWIRSRTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM

record in wsp.fasta:

>gi|564250271|ref|XP|006264203.1|  PREDICTED: apolipoprotein D [Alligator mississippiensis]; MatureChain: 21-179QTFHMGQCPNPPVQEDFDPSKYLGKWYEIEKLPSGFEQERCVQANYSLKANGKIKVLTKMVRSAQHLTCLQHRMMLLVSSPVMPASPYWVVATDYENYALVYSCTSFFWLFHVDYAWIRSRTPQLHPETVEHLKSVLRSYRIQTGMMLPTDQMNCPSDM

However, not all the proteins in nsp.fasta contained a signal peptide, so wsp.fasta is a subset of the proteins in nsp.fasta that contains the signal. What I need is a unique file that contains all the protein records, both proteins with no signal peptide found and the mature chains with the signal peptide stripped.

I have tried the following:

from Bio import SeqIO

file1 = SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")

file2 = SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")

for seq1 in file1:
for seq2 in file2:
if seq2.id == seq1.id:
seq1.seq = seq2.seq
SeqIO.write(seq1, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")

但是根本没有输出。我尝试将 SeqIO.write 放入循环之外,它返回一个空白文件。我究竟做错了什么?是否已经存在任何方法来合并两个文件或用另一个文件中的序列替换一个文件中的序列?

提前谢谢您!!

塞尔吉奥

编辑代码,我添加了一个 elif 子句,试图在 nsp.fasta 中添加与 wsp.fasta 不匹配的记录,但它不起作用:

to_write = []

for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.txt", "fasta"):
for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.txt", "fasta"):
if seq1.id == seq2.id:
seq1.seq = seq2.seq
to_write.append(seq1)
elif seq1.id != seq2.id:
to_write.append(seq1)

SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.txt", "fasta")

最佳答案

正如您所写的那样,每次编写新序列时,都会覆盖前一个序列。尝试将记录存储在列表中,然后在循环完成时写出该列表。

to_write = []
for seq1 in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta"):
for seq2 in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta"):
if seq2.id == seq1.id:
seq1.seq = seq2.seq
to_write.append(seq1)
SeqIO.write(to_write, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")

编辑以建议使用列表推导的另一种方法:

ids_to_save = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\nsp.fasta", "fasta")]
records_to_save = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id in ids_to_save)]
SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")

编辑以解决“在 nsp.fasta 中添加与 wsp.fasta 不匹配的记录”的需求 - 一般方法,不一定是精确的代码:

ids_not_wanted = [x.id for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta")]
records_to_save_2 = [x for x in SeqIO.parse(r"c:\Users\Sergio\Desktop\wsp.fasta", "fasta") if (x.id not in ids_not_wanted)]

records_to_save.append(records_to_save_2)
# If duplicate records are a problem, eliminate them using "set"
records_to_save = list(set(records_to_save))
SeqIO.write(records_to_save, r"c:\Users\Sergio\Desktop\nuevsp.fasta", "fasta")

关于python - 使用 Biopython 替换文件之间的序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32911990/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com