gpt4 book ai didi

python - 使用 CIGAR 推断序列的长度

转载 作者:行者123 更新时间:2023-11-28 18:28:09 24 4
gpt4 key购买 nike

给你一些上下文:我正在尝试将 sam 文件转换为 bam

samtools view -bT reference.fasta sequences.sam > sequences.bam

退出并出现以下错误

[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1] parse error at line 102
[main_samview] truncated file

违规行如下所示:

SRR808297.2571281       99      gi|309056|gb|L20934.1|MSQMTCG   747     80      101M    =       790     142     TTGGTATAAAATTTAATAATCCCTTATTAATTAATAAACTTCGGCTTCCTATTCGTTCATAAGAAATATTAGCTAAACAAAATAAACCAGAAGAACAT      @@CFDDFD?HFDHIGEGGIEEJIIJJIIJIGIDGIGDCHJJCHIGIJIJIIJJGIGHIGICHIICGAHDGEGGGGACGHHGEEEFDC@=?CACC>CCC      NM:i:2  MD:Z:98A1A

我的序列有 98 个字符长,但创建 sam 文件时可能存在错误,在 CIGAR 中报告为 101。我可以给自己一个奢侈的机会,让自己失去几次阅读,而且我目前无法访问生成 sam 文件的源代码,因此没有机会找出错误并重新运行比对。换句话说,我需要一个务实的解决方案来继续前进(目前)。因此,我设计了一个 python 脚本来计算我的核苷酸串的长度,将其与 CIGAR 中注册的内容进行比较,并将“合理”的行保存在一个新文件中。

#!/usr/bin/python
import itertools
import cigar

with open('myfile.sam', 'r') as f:
for line in itertools.islice(f,3,None): #Loop through the file and skip the first three lines
cigar=line.split("\t")[5]
cigarlength = len(Cigar(cigar)) #Use module Cigar to obtain the length reported in the CIGAR string
seqlength = len(line.split("\t")[9])

if (cigarlength == seqlength):
...Preserve the line in a new file...

如您所见,为了将 CIGAR 转换为显示长度的整数,我使用了模块 CIGAR .老实说,我对它的行为有点警惕。在非常明显的情况下,该模块似乎错误计算了长度。是否有其他模块或更明确的策略将 CIGAR 转换为序列的长度?

旁注:有趣的是,至少可以说,这个问题已被广泛报道,但在互联网上找不到实用的解决方案。请参阅以下链接:

https://github.com/COMBINE-lab/RapMap/issues/9
http://seqanswers.com/forums/showthread.php?t=67253
http://seqanswers.com/forums/showthread.php?t=21120
https://groups.google.com/forum/#!msg/snap-user/FoDsGeNBDE0/nRFq-GhlAQAJ

最佳答案

SAM spec为我们提供了这张 CIGAR 操作表,它指示哪些操作“消耗”了查询或引用,并附有关于如何从 CIGAR 字符串计算序列长度的明确说明:

                                                             Consumes  Consumes
Op BAM Description query reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes
  • “Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.

...

  • Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ.

这让我们可以通过将 CIGAR 中所有“消费查询”操作的长度相加,从其 CIGAR 中简单地计算出序列的长度。这正是 cigar 模块中发生的事情(参见 https://github.com/brentp/cigar/blob/754cfed348364d390ec1aa40c951362ca1041f7a/cigar.py#L88-L93 ),所以我不知道为什么这里的 OP 认为该模块的实现是错误的。

如果我们从(已经很短的)雪茄模块中提取相关代码,我们将得到类似上面引述中描述的求和操作的简短 Python 实现:

from itertools import groupby

def query_len(cigar_string):
"""
Given a CIGAR string, return the number of bases consumed from the
query sequence.
"""
read_consuming_ops = ("M", "I", "S", "=", "X")
result = 0
cig_iter = groupby(cigar_string, lambda chr: chr.isdigit())
for _, length_digits in cig_iter:
length = int(''.join(length_digits))
op = next(next(cig_iter)[1])
if op in read_consuming_ops:
result += length
return result

关于python - 使用 CIGAR 推断序列的长度,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39710796/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com