gpt4 book ai didi

python - 通过 .txt 文件解析以创建制表符分隔的输出文件

转载 作者:太空宇宙 更新时间:2023-11-03 10:50:57 25 4
gpt4 key购买 nike

[MacOS, Python 2.7]

我正在尝试解析 .txt 文件并提取我想要创建制表符分隔表的字符串。我将不得不为许多文件执行此操作,但我在选择某些字符串时遇到了问题。

下面是一个输入文件示例:

# Assembly name:  ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome
CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na

到目前为止,我的代码如下所示,其中 headstring 表示列标题:

# Open the input file for reading 
InFile = open(InFileName, 'r')
#f = open(InFileName, 'r')

# Write the header
Headstring= "GenBank_Assembly_ID RefSeq_Assembly_ID Assembly_level Chromosome Plasmid Refseq_chromosome Refseq_plasmid1 Refseq_plasmid2 Refseq_plasmid3 Refseq_plasmid4 Refseq_plasmid5"

# Set up chromosome and plasmid count
ccount = 0
pcount = 0

# Look for corresponding data from each file
with open(InFileName, 'r') as searchfile:
for line in searchfile:
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
if "Chromosome" in line:
ccount += 1
print ccount
if "Plasmid" in line:
pcount += 1
print pcount



OutputString = "%s\t%s\t%s\t%s\t%s\t" % (GCA, GCF, assembly, ccount, pcount)


OutFile=open(OutFileName, 'w')
OutFile.write(Headstring+'\n'+OutputString)


InFile.close()
OutFile.close()

我遇到的主要问题是我想提取字符串 NC_010611.1NC_010605.1NC_010606.1 , 并在它们之间有制表符空格在同一行上,因此它们分别位于标题 Refseq_chromosome、Refseq_plasmid1Refseq_plasmid2 下。但我只希望脚本在 assembly = "Chromosome""Complete Genome" 时搜索这些。我不确定如何仅在该条件为 true 时搜索字符串。

我知道获取这些字符串的正则表达式可以是 =\t(\w+..),但我只知道这些。

我是 Python 的新手,所以解释会很棒。

最佳答案

看看这个例子:

import re

InFileName = 'YOUR_INPUT_FILE_NAME'
OutFileName = 'YOUR_OUTPUT_FILE_NAME'

# Write the header
Headstring= "GenBank_Assembly_ID\tRefSeq_Assembly_ID\tAssembly_level\tChromosome\tPlasmid\tRefseq_chromosome\tRefseq_plasmid1\tRefseq_plasmid2\tRefseq_plasmid3\tRefseq_plasmid4\tRefseq_plasmid5"

# Look for corresponding data from each file
with open(InFileName, 'r') as InFile, open(OutFileName, 'w') as OutFile:
chromosomes = []
plasmids = []
for line in InFile:
if line.lstrip()[0] == '#':
# Process header part of the file differently from the data part
if re.search( r': (GCA_[\d\.]+)', line, re.M|re.I):
GCA = re.search( r': (GCA_[\d\.]+)', line, re.M|re.I)
print GCA.group(1)
GCA = GCA.group(1)
if re.search( r': (GCF_[\d\.]+)', line, re.M|re.I):
GCF = re.search( r': (GCF_[\d\.]+)', line, re.M|re.I)
print GCF.group(1)
GCF = GCF.group(1)
if re.search ( r'level: (.+$)', line, re.M|re.I):
assembly = re.search( r'level: (.+$)', line, re.M|re.I)
print assembly.group(1)
assembly = assembly.group(1)
elif assembly in ['Chromosome', 'Complete Genome']:
# Process each data line separately
split_line = line.split()
Type = split_line[3]
RefSeq_Accn = split_line[6]
if Type == "Chromosome":
chromosomes.append(RefSeq_Accn)
if Type == "Plasmid":
plasmids.append(RefSeq_Accn)

# Merge names of up to N chromosomes
N = 1
cstr = ''
for i in range(N):
if i < len(chromosomes):
nextChromosome = chromosomes[i]
else:
nextChromosome = ''
cstr += '\t' + nextChromosome

# Merge names of up to M plasmids
M = 5
pstr = ''
for i in range(M):
if i < len(plasmids):
nextPlasmid = plasmids[i]
else:
nextPlasmid = ''
pstr += '\t' + nextPlasmid

OutputString = "%s\t%s\t%s\t%s\t%s" % (GCA, GCF, assembly, len(chromosomes), len(plasmids))
OutputString += cstr
OutputString += pstr

OutFile.write(Headstring+'\n'+OutputString)

输入:

# Assembly name:  ASM1844v1
# Organism name: Acinetobacter baumannii ACICU (g-proteobacteria)
# Infraspecific name: strain=ACICU
# Taxid: 405416
# BioSample: SAMN02603140
# BioProject: PRJNA17827
# Submitter: CNR - National Research Council
# Date: 2008-4-15
# Assembly type: n/a
# Release type: major
# Assembly level: Complete Genome
# Genome representation: full
# GenBank assembly accession: GCA_000018445.1
# RefSeq assembly accession: GCF_000018445.1
# RefSeq assembly and GenBank assemblies identical: yes
#
## Assembly-Units:
## GenBank Unit Accession RefSeq Unit Accession Assembly-Unit name
## GCA_000018455.1 GCF_000018455.1 Primary Assembly
#
# Ordered by chromosome/plasmid; the chromosomes/plasmids are followed by
# unlocalized scaffolds.
# Unplaced scaffolds are listed at the end.
# RefSeq is equal or derived from GenBank object.
#
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
ANONYMOUS assembled-molecule na Chromosome CP000863.1 = NC_010611.1 Primary Assembly 3904116 na
pACICU1 assembled-molecule pACICU1 Plasmid CP000864.1 = NC_010605.1 Primary Assembly 28279 na
pACICU2 assembled-molecule pACICU2 Plasmid CP000865.1 = NC_010606.1 Primary Assembly 64366 na

输出:

GenBank_Assembly_ID  RefSeq_Assembly_ID      Assembly_level  Chromosome  Plasmid Refseq_chromosome  Refseq_plasmid1 Refseq_plasmid2  Refseq_plasmid3 Refseq_plasmid4  Refseq_plasmid5
GCA_000018445.1 GCF_000018445.1 Complete Genome 1 2 NC_010611.1 NC_010605.1 NC_010606.1

与您的脚本的主要区别:

  • 我使用条件 if line.lstrip()[0] == '#' 以不同于“表格行”的方式处理“标题”行(以散列字符开头的行)在底部(实际包含每个序列数据的行)。
  • 我使用条件 if assembly in ['Chromosome', 'Complete Genome'] - 这是您在问题中指定的条件
  • 我将每个表行拆分为这样的值 split_line = line.split()。之后,我通过 Type = split_line[3] 获取类型(这是表数据中的第四列)并且 RefSeq_Accn = split_line[6] 给我第七列在表格中。

关于python - 通过 .txt 文件解析以创建制表符分隔的输出文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50330355/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com