gpt4 book ai didi

python - 保存一个文本字符串,如果它后面有不同的特定文本字符串?

转载 作者:行者123 更新时间:2023-12-01 08:13:22 31 4
gpt4 key购买 nike

抱歉,标题不好,我不知道如何表达我的问题。

我编写了一个脚本,可以从 fastq 文件(纯文本基因组读取文件)中提取数据。每第一行都是标题,第二行是基本字符串 - 不需要第三行和第四行。

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as f_obj:
file_contents = f_obj.readlines()

extracted_lines = ''
line_count = 0

# Pull header and base lines
for line in file_contents:
line_count += 1
# Headers
if line_count == 1:
extracted_lines += line
# Reads ending in A
elif line_count == 2 and line[-2] == 'A':
extracted_lines += line
# Reads ending in G
elif line_count == 2 and line[-2] == 'G':
extracted_lines += line
# Reset counter
elif line_count == 4:
line_count = 0

with open(new_filename, 'w') as f_obj:
f_obj.write(extracted_lines)
print(new_filename + " was created.")

只要碱基的读数以 A 或 G 结尾,脚本就会提取每个读数的 header 以及读数中的碱基字符串。输入文件的示例如下:

@HWI-D00461:137:C9H2FACXX:3:1101:1239:1968 1:N:0:GGCTAC
NTGTGTAATAGATTTTACTTTTGCCTTTAAGCCCAAGGTCCTGGACTTGAAACATCCAAGGGATGGAAAATGCCGTATAACAGGGTGGAAGAGAGATTTGA
+
#1=BDDFFHHHFHIJJJJJJJJJJJJJJJJJJJJJIJJIJJJJJHJIIJHGIJJJJJJIHJJBGHJHIIJJJHHHHFFFFEEEDD;?BACDDDA?@CDDDC
@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
+
#1=DDDFDFHHHGHIIGJJJJHIJIHHDIHHIJGGEI@GFGHIHIJHEFHIIIIGIJGHHGECFGIDHGIHIIEGIIJHHEEFFF7?ACEECCBBDEDDDC

输出文件如下所示。

@HWI-D00461:137:C9H2FACXX:3:1101:1117:1968 1:N:0:GGCTAC
NAAAGTCTACCAATTATACTTAGTGTGAAGAGGTGGGAGTTAAATATGACTTCCATTAATAGTTTCATTGTTTGGAAAACAGAGGTAATTTTTGATACAGA
@HWI-D00461:137:C9H2FACXX:3:1101:1200:1972 1:N:0:GGCTAC
@HWI-D00461:137:C9H2FACXX:3:1101:1087:1973 1:N:0:GGCTAC
NTAATCCAACTAACTAAAAATAAAAAGATTCAAATAGGTACAGAAAACAATGAAGGTGTAGAGGTGAGAAATCAACAGGATGTTCAGAAGCCTGTGTATGA

虽然这包含了所需的所有数据,但它提取了每个标题行(以“@”开头),这是不必要的。

如果代码后面是一串以 A 或 G 结尾的碱基,如何修改代码以仅提取标题行?

最佳答案

问题是您要为每条记录添加id,而不仅仅是您感兴趣的记录。一个快速的解决方案是保留id在变量中,仅在必要时添加它:

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as f_obj:
file_contents = f_obj.readlines()

extracted_lines = ''
line_count = 0

# Pull header and base lines
for line in file_contents:
line_count += 1
# Headers
if line_count == 1:
id_string = line
# Reads ending in A
elif line_count == 2 and line[-2] == 'A':
extracted_lines += id_string
extracted_lines += line
# Reads ending in G
elif line_count == 2 and line[-2] == 'G':
extracted_lines += id_string
extracted_lines += line
# Reset counter
elif line_count == 4:
line_count = 0

with open(new_filename, 'w') as f_obj:
f_obj.write(extracted_lines)
print(new_filename + " was created.")

我还必须说,该代码效率不高,特别是在内存使用方面:您正在将一个(通常)非常大的文件读入内存,但一次只需要一条记录。

次要问题是您的条件可以被压缩,并且您可以使用模数来了解您所处的线类型:

filename = 'C0_GGCTAC_R1_no_adapter_trimming.fastq'
new_filename = filename[:-9] + '_new.fastq'

with open(filename) as in_f_obj, open(new_filename, 'w') as out_f_obj:
# Process the file
line_count = 0
for line in in_f_obj:
line_count += 1

# Extract the information for each record
if line_count % 4 == 1:
id_string = line
elif line_count % 4 == 2:
seq = line
elif line_count % 4 == 3:
extra = line
elif line_count % 4 == 4:
# Last part of the record. Here we have all the information
# and we can decide if we want to output something
# and what we want to output
qual = line
if seq[-2] == 'A' or seq[-2] == 'G'
out_f_obj.write(id_string)
out_f_obj.write(seq)

print(new_filename + " was created.")

在此代码中,您仅在内存中保留一条记录。 line_count 变量包含已处理的实际行数,并且您拥有输入中的所有数据,因此您可以随后轻松更改输出。

我会添加一个额外的细节,我会在每个读取行中删除换行符,并在写入时根据需要添加它:

# Extract the information for each record
if line_count % 4 == 1:
id_string = line.rstrip()
elif line_count % 4 == 2:
seq = line.rstrip()
elif line_count % 4 == 3:
extra = line.rstrip()
elif line_count % 4 == 4:
# Last part of the record. Here we have all the information
# and we can decide if we want to output something
# and what we want to output
qual = line.rstrip()
if seq[-1] == 'A' or seq[-1] == 'G'
out_f_obj.write("{}\n{}\n".format(id_string, seq))

这样,您的数据就干净了,输入文件中没有换行格式。

关于python - 保存一个文本字符串,如果它后面有不同的特定文本字符串?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55099426/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com