gpt4 book ai didi

python - 改进DNA比对去间隙的代码设计

转载 作者:太空狗 更新时间:2023-10-29 21:54:44 24 4
gpt4 key购买 nike

这是一个关于更高效的代码设计的问题:

假设三个对齐的 DNA 序列(seq1、seq2 和 seq3;它们都是字符串)代表两个基因(gene1 和 gene2)。相对于比对的 DNA 序列,这些基因的起始和终止位置是已知的。

# Input
align = {"seq1":"ATGCATGC", # In seq1, gene1 and gene2 are of equal length
"seq2":"AT----GC",
"seq3":"A--CA--C"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}

我希望从比对中移除间隙(即破折号)并保持基因起始和终止位置的相对关联。

# Desired output
align = {"seq1":"ATGCATGC",
"seq2":"ATGC",
"seq3":"ACAC"}
annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,1], "gene2":[2,3]},
"seq3":{"gene1":[0,1], "gene2":[2,3]}}

获得所需的输出并不像看起来那么简单。下面我为这个问题写了一些(行号)伪代码,但肯定有更优雅的设计。

1  measure length of any aligned gene  # take any seq, since all seqs aligned
2 list_lengths = list of gene lengths # order is important
3 for seq in alignment
4 outseq = ""
5 for each num in range(0, length(seq)) # weird for-loop is intentional
6 if seq[num] == "-"
7 current_gene = gene whose start/stop positions include num
8 subtract 1 from length of current_gene
9 subtract 1 from lengths of all genes following current_gene in list_lengths
10 else
11 append seq[num] to outseq
12 append outseq to new variable
13 convert gene lengths into start/stop positions and append ordered to new variable

谁能给我一些更短、更直接的代码设计的建议/示例?

最佳答案

这个答案处理从评论到 cdlanes 答案的更新的 annos 字典。该答案使 annos 字典中的 seq2 gene2 索引不正确 [2,1]。如果序列包含该区域中的所有缺口,我提出的解决方案将从字典中删除 gene 条目。还要注意,如果一个基因在最后的 align 中只包含一个字母,那么 anno[geneX] 将具有相同的开始和停止索引 --> 请参见 seq3 gene1 来自您评论的 annos

align = {"seq1":"ATGCATGC",
"seq2":"AT----GC",
"seq3":"A--CA--C"}

annos = {"seq1":{"gene1":[0,3], "gene2":[4,7]},
"seq2":{"gene1":[0,3], "gene2":[4,7]},
"seq3":{"gene1":[0,3], "gene2":[4,7]}}


annos3 = {"seq1":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq2":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]},
"seq3":{"gene1":[0,2], "gene2":[3,4], "gene3":[5,7]}}

import re
for name,anno in annos.items():
# indices of gaps removed usinig re
removed = [(m.start(0)) for m in re.finditer(r'-', align[name])]

# removes gaps from align dictionary
align[name] = re.sub(r'-', '', align[name])

build_dna = ''
for gene,inds in anno.items():

start_ind = len(build_dna)+1

#generator to sum the num '-' removed from gene
num_gaps = sum(1 for i in removed if i >= inds[0] and i <= inds[1])

# build the de-gapped string
build_dna+= align[name][inds[0]:inds[1]+1].replace("-", "")

end_ind = len(build_dna)

if num_gaps == len(align[name][inds[0]:inds[1]+1]): #gene is all gaps
del annos[name][gene] #remove the gene entry
continue
#update the values in the annos dictionary
annos[name][gene][0] = start_ind-1
annos[name][gene][1] = end_ind-1

结果:

In [3]: annos
Out[3]: {'seq1': {'gene1': [0, 3], 'gene2': [4, 7]},
'seq2': {'gene1': [0, 1], 'gene2': [2, 3]},
'seq3': {'gene1': [0, 1], 'gene2': [2, 3]}}

上述 3 个基因 annos 的结果。只需替换 annos 变量:

In [5]: annos3
Out[5]: {'seq1': {'gene1': [0, 2], 'gene2': [3, 4], 'gene3': [5, 7]},
'seq2': {'gene1': [0, 1], 'gene3': [2, 3]},
'seq3': {'gene1': [0, 0], 'gene2': [1, 2], 'gene3': [3, 3]}}

关于python - 改进DNA比对去间隙的代码设计,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34816513/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com