gpt4 book ai didi

python - biopython 比对的无间隙索引

转载 作者:太空宇宙 更新时间:2023-11-04 04:58:12 25 4
gpt4 key购买 nike

我第一次使用biopython。如果这是一个基本问题,请原谅我。

我想输入序列,然后对齐它们,并能够引用原始序列(无间隙)和对齐序列(有间隙)的索引位置。

我的现实世界示例是烯醇 enzyme (Uniprot P37869P0A6P9)。结合底物的赖氨酸在大肠杆菌中为 392,在枯草芽孢杆菌中为 389。如果对两者进行肌肉比对,则该赖氨酸在比对中的索引为 394。我希望能够在 gappend 和 ungapped 索引之间轻松转换。

示例 1: 大肠杆菌残基 #392 的比对索引是什么? (对齐序列中的答案 394)。

示例 2 我在 394 处的比对中发现了一个保守残基。它在原始(无间隙)序列中的什么位置? (在大肠杆菌中回答 392,在枯草芽孢杆菌中回答 389)。

>>>cline = MuscleCommandline(input="enolase.txt", out="enolase.aln")
>>>cline()

>>> alignment = AlignIO.read("enolase.aln","fasta")
>>> print(alignment[:,390:])
SingleLetterAlphabet() alignment with 2 rows and 45 columns
AGQIKTGAPSRTDRVAKYNQLLRIEDQLAETAQYHGINSFYNLNK sp|P37869|ENO_BACSU
AGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYNGRKEIKGQA- sp|P0A6P9|ENO_ECOLI
>>> print(alignment[:,394])
KK

最佳答案

有趣的问题!据我所知,BioPython 中没有内置的东西。这是我的解决方法。

让我们从您的示例 2 开始。如果您使用两个文件 enolase.txtenolase.aln 分别使用 FASTA 格式的原始未缺口序列和比对序列,然后我们可以遍历压缩记录,计算比对序列中的空位数并计算未空位序列中残基的索引:

from Bio import SeqIO, AlignIO

def find_in_original(index, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')

for original_record, alignment_record in zip(original, alignment):
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)

gaps = alignment_seq[:index].count('-')
original_index = len(alignment_seq[:index]) - gaps
assert alignment_seq[index] == original_seq[original_index]
yield ("The conserved residue {} at location {} in the alignment can be"
" found at location {} in {}.".format(alignment_seq[index],
index, original_index, original_record.id.split('|')[-1]))

结果如下:

>>> for result in  find_in_original(394, 'enolase.txt', 'enolase.aln'):
... print result
The conserved residue K at location 394 in the alignment can be found at location 392 in ENO_ECOLI.
The conserved residue K at location 394 in the alignment can be found at location 389 in ENO_BACSU.

对于反向操作,我们查看比对中所有可能的索引,如果我们减去间隙,看看哪个等于无间隙序列:

def find_in_alignment(index, organism, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')

for original_record, alignment_record in zip(original, alignment):
if organism in original_record.id:
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)

residue = original_seq[index]
for i in range(index, len(alignment_seq)):
if alignment_seq[i] == residue and \
alignment_seq[:i].replace('-', '') == original_seq[:index]:
return ("The residue {} at location {} in {} is at location"
" {} in the alignment.".format(residue, index,
organism, i))

结果如下:

>>> print find_in_alignment(392, 'ENO_ECOLI', 'enolase.txt', 'enolase.aln')
The residue K at location 392 in ENO_ECOLI is at location 394 in the alignment.
>>> print find_in_alignment(389, 'ENO_BACSU', ungapped_path, alignment_path)
The residue K at location 389 in ENO_BACSU is at location 394 in the alignment.

关于python - biopython 比对的无间隙索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46535260/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com