gpt4 book ai didi

python - 如何在 python 中编辑文本 (.fastq) 文件

转载 作者:太空宇宙 更新时间:2023-11-03 14:54:40 24 4
gpt4 key购买 nike

我有一个文件,如下面的小例子。每 4 行与一个 ID 相关。每个 ID 的第二行都以 N 开头。我想删除这些行开头的 N,其他所有内容都将保持不变。我想在 python 中做到这一点。你知道怎么做吗?

例子:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
NGCGACCTCAGATCAGACGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
NGCCGACATCGAAGGATCAA
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
NACAAACCCTTGTGTCGAGGGC
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
#=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
NGGGACATGACAGCCTGGACCATCG
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
#=ABBGGGGGGGGGGGGGGGGGGGG

输出:

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
#<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
#<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
#=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
#=ABBGGGGGGGGGGGGGGGGGGGG

最佳答案

如果我完全按照您的要求去做(从每个序列中删除起始 N),那么将留下 FASTQ file处于不一致的状态。

FASTQ 文件的每四行包含前两行序列的质量值。因此,如果您从序列中删除第一个字符,您还需要从具有质量值的行中删除第一个字符。

你可以用纯 Python 做一些非常简单的事情,比如

with open("example.fastq") as f:
for idx, line in enumerate(f.read().splitlines()):
if idx % 2:
print(line[1:])
else:
print(line)

但是如果您要定期处理生物数据,您真的应该开始使用像 BioPython 这样的生物信息学模块。 .如果您尝试执行会使文件形状不一致或没有意义的操作,它会警告您。

解决方案如下所示:

from Bio import SeqIO
from Bio import Seq

new_records = []
for record in SeqIO.parse("example.fastq", "fastq"):
sequence = str(record.seq)
letter_annotations = record.letter_annotations

# You first need to empty the existing letter annotations
record.letter_annotations = {}

new_sequence = sequence[1:]
record.seq = Seq.Seq(new_sequence)


new_letter_annotations = {'phred_quality': letter_annotations['phred_quality'][1:]}
record.letter_annotations = new_letter_annotations

new_records.append(record)


with open('without_starting_N.fastq', 'w') as output_handle:
SeqIO.write(new_records, output_handle, "fastq")

哪些输出

@SRR2163140.1 HISEQ:148:C670LANXX:3:1101:1302:1947 length=50
GCGACCTCAGATCAGACGTGGCGACC
+
<<ABGGGGGGGGGGGGGGGGGGGGGG
@SRR2163140.3 HISEQ:148:C670LANXX:3:1101:1381:1997 length=50
GCCGACATCGAAGGATCAA
+
<<ABFGGGGGGGGGGGGGG
@SRR2163140.4 HISEQ:148:C670LANXX:3:1101:1705:1940 length=50
ACAAACCCTTGTGTCGAGGGC
+
=ABBGGGGGGGGGGGGGGGGG
@SRR2163140.7 HISEQ:148:C670LANXX:3:1101:1704:1965 length=50
GGGACATGACAGCCTGGACCATCG
+
=ABBGGGGGGGGGGGGGGGGGGGG

(每三行的“+”字符可选后跟前面两行的相同序列标识符和描述)

关于python - 如何在 python 中编辑文本 (.fastq) 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43542350/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com