gpt4 book ai didi

python - BCBio 的 GFF 解析器解析不正确

转载 作者:行者123 更新时间:2023-11-28 18:45:58 25 4
gpt4 key购买 nike

我正在试验 BCBio 的 GFF 解析器,希望能将它用于我的工具。我从 NCBI 的 RefSeq 数据库中获取了一个测试 .gbk 文件,并用它解析成一个 .gff 文件。

我使用的代码(来自 http://biopython.org/wiki/GFF_Parsing ):

#!/usr/bin/python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
in_file = "/var/www/localhost/NC_009925.gbk"
out_file = "/var/www/localhost/output/your_file.gff"
in_handle = open(in_file)
out_handle = open(out_file, "w")

GFF.write(SeqIO.parse(in_handle, "genbank"), out_handle)

in_handle.close()
out_handle.close()

convert_to_GFF3()

部分结果如下:

##gff-version 3
##sequence-region NC_009925.1 1 6503724
NC_009925.1 annotation remark 1 6503724 . . . accessions=NC_009925;comment=PROVISIONAL REFSEQ: This record has not yet been subject to final%0ANCBI review. The reference sequence was derived from CP000828.%0ASource bacteria from Marine Biotechnology Institute Culture%0ACollection%2C Marine Biotechnology Institute%2C 3-75-1 Heita%2C Kamaishi%2C%0AIwate 026-0001%2C Japan.%0ACOMPLETENESS: full length.;data_file_division=CON;date=10-JUN-2013;gi=158333233;keywords=;organism=Acaryochloris marina MBIC11017;references=location: %5B0:6503724%5D%0Aauthors: Swingley%2CW.D.%2C Chen%2CM.%2C Cheung%2CP.C.%2C Conrad%2CA.L.%2C Dejesa%2CL.C.%2C Hao%2CJ.%2C Honchak%2CB.M.%2C Karbach%2CL.E.%2C Kurdoglu%2CA.%2C Lahiri%2CS.%2C Mastrian%2CS.D.%2C Miyashita%2CH.%2C Page%2CL.%2C Ramakrishna%2CP.%2C Satoh%2CS.%2C Sattley%2CW.M.%2C Shimada%2CY.%2C Taylor%2CH.L.%2C Tomo%2CT.%2C Tsuchiya%2CT.%2C Wang%2CZ.T.%2C Raymond%2CJ.%2C Mimuro%2CM.%2C Blankenship%2CR.E. and Touchman%2CJ.W.%0Atitle: Niche adaptation and genome expansion in the chlorophyll d-producing cyanobacterium Acaryochloris marina%0Ajournal: Proc. Natl. Acad. Sci. U.S.A. 105 %286%29%2C 2005-2010 %282008%29%0Amedline id: %0Apubmed id: 18252824%0Acomment:,location: %5B0:6503724%5D%0Aauthors: %0Aconsrtm: NCBI Genome Project%0Atitle: Direct Submission%0Ajournal: Submitted %2817-OCT-2007%29 National Center for Biotechnology Information%2C NIH%2C Bethesda%2C MD 20894%2C USA%0Amedline id: %0Apubmed id: %0Acomment:,location: %5B0:6503724%5D%0Aauthors: Touchman%2CJ.W.%0Atitle: Direct Submission%0Ajournal: Submitted %2827-AUG-2007%29 Pharmaceutical Genomics Division%2C Translational Genomics Research Institute%2C 13208 E Shea Blvd%2C Scottsdale%2C AZ 85004%2C USA%0Amedline id: %0Apubmed id: %0Acomment:;sequence_version=1;source=Acaryochloris marina MBIC11017;taxonomy=Bacteria,Cyanobacteria,Oscillatoriophycideae,Chroococcales,Acaryochloris
NC_009925.1 feature source 1 6503724 . + . db_xref=taxon:329726;mol_type=genomic DNA;note=type strain of Acaryochloris marina;organism=Acaryochloris marina MBIC11017;strain=MBIC11017
NC_009925.1 feature gene 931 1581 . - . db_xref=GeneID:5685235;locus_tag=AM1_0001;note=conserved hypothetical protein;pseudo=
NC_009925.1 feature gene 1627 2319 . - . db_xref=GeneID:5678840;locus_tag=AM1_0003

问题出在第三行和第四行:它从 .gbk 中获取完整的 header 信息并将其作为一行放入,而它应该跳过它。最后两行是正确的(输出文件的其余部分也是如此)。我试过使用几个不同的 .gbk 文件,所有文件都产生相同的结果。

作为引用,这里是 .gbk 文件的开头:

LOCUS       NC_009925            6503724 bp    DNA     circular CON 10-JUN-2013
DEFINITION Acaryochloris marina MBIC11017 chromosome, complete genome.
ACCESSION NC_009925
VERSION NC_009925.1 GI:158333233
DBLINK Project: 58167
BioProject: PRJNA58167
KEYWORDS .
SOURCE Acaryochloris marina MBIC11017
ORGANISM Acaryochloris marina MBIC11017
Bacteria; Cyanobacteria; Oscillatoriophycideae; Chroococcales;
Acaryochloris.
REFERENCE 1 (bases 1 to 6503724)
AUTHORS Swingley,W.D., Chen,M., Cheung,P.C., Conrad,A.L., Dejesa,L.C.,
Hao,J., Honchak,B.M., Karbach,L.E., Kurdoglu,A., Lahiri,S.,
Mastrian,S.D., Miyashita,H., Page,L., Ramakrishna,P., Satoh,S.,
Sattley,W.M., Shimada,Y., Taylor,H.L., Tomo,T., Tsuchiya,T.,
Wang,Z.T., Raymond,J., Mimuro,M., Blankenship,R.E. and
Touchman,J.W.
TITLE Niche adaptation and genome expansion in the chlorophyll
d-producing cyanobacterium Acaryochloris marina
JOURNAL Proc. Natl. Acad. Sci. U.S.A. 105 (6), 2005-2010 (2008)
PUBMED 18252824
REFERENCE 2 (bases 1 to 6503724)
CONSRTM NCBI Genome Project
TITLE Direct Submission
JOURNAL Submitted (17-OCT-2007) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
REFERENCE 3 (bases 1 to 6503724)
AUTHORS Touchman,J.W.
TITLE Direct Submission
JOURNAL Submitted (27-AUG-2007) Pharmaceutical Genomics Division,
Translational Genomics Research Institute, 13208 E Shea Blvd,
Scottsdale, AZ 85004, USA
COMMENT PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The reference sequence was derived from CP000828.
Source bacteria from Marine Biotechnology Institute Culture
Collection, Marine Biotechnology Institute, 3-75-1 Heita, Kamaishi,
Iwate 026-0001, Japan.
COMPLETENESS: full length.
FEATURES Location/Qualifiers
source 1..6503724
/organism="Acaryochloris marina MBIC11017"
/mol_type="genomic DNA"
/strain="MBIC11017"
/db_xref="taxon:329726"
/note="type strain of Acaryochloris marina"
gene complement(931..1581)
/locus_tag="AM1_0001"
/note="conserved hypothetical protein"
/pseudo
/db_xref="GeneID:5685235"
gene complement(1627..2319)
/locus_tag="AM1_0003"
/db_xref="GeneID:5678840"
CDS complement(1627..2319)
/locus_tag="AM1_0003"
/codon_start=1
/transl_table=11
/product="NUDIX hydrolase"
/protein_id="YP_001514406.1"
/db_xref="GI:158333234"
/db_xref="GeneID:5678840"
/translation="MPYTYDYPRPGLTVDCVVFGLDEQIDLKVLLIQRQIPPFQHQWA
LPGGFVQMDESLEDAARRELREETGVQGIFLEQLYTFGDLGRDPRDRIISVAYYALIN
LIEYPLQASTDAEDAAWYSIENLPSLAFDHAQILKQAIRRLQGKVRYEPIGFELLPQK
FTLTQIQQLYETVLGHPLDKRNFRKKLLKMDLLIPLDEQQTGVAHRAARLYQFDQSKY
ELLKQQGFNFEV"

有谁知道我该如何解决这个问题?

我使用以下行来过滤掉前两行错误的行:

if "\tannotation\t" in line or "feature\tsource" in line:

这似乎适用于多个测试 .gbk。但我仍然很好奇为什么它首先要解析这些?

最佳答案

答案在您链接的维基页面 ( http://biopython.org/wiki/GFF_Parsing#Writing_GFF3 ) 中。 “GFF3Writer 采用 SeqRecord 对象的迭代器,并将每个 SeqFeature 写入 GFF3 行”。从.gbk文件中解析出来的SeqRecord对象包含这个注解,因此是作者自己写的。在实现(https://github.com/chapmanb/bcbb/blob/master/gff/BCBio/GFF/GFFOutput.py)中,您可以看到它在哪里完成:

self._write_annotations(rec.annotations, rec.id, len(rec.seq), out_handle)

您还可以看到为什么传递了 source 特性。它只是一个与其他特征(基因、CDS)一样的特征,没有单独对待。

我不知道为什么没有选项或参数(至少我没有找到)告诉作者跳过注释。在使用 SeqIO.parse() 读取 SeqRecords 时,我也不知道有任何参数可以跳过注释。

为了解决您的问题,我分别访问了已解析的SeqRecords,删除了注释和源特征。这种方法的一个缺点是需要额外的内存(以及性能损失),因为我是从初始生成器创建一个列表。最后我只是将列表解析为 GFF。我不知道这种方法是否比您的方法好得多。

#!/usr/bin/env python
from BCBio import GFF
from Bio import SeqIO

def convert_to_GFF3():
in_file = "input.gbk"
out_file = "output.gff"
in_handle = open(in_file)
out_handle = open(out_file, "w")

records = []
for record in SeqIO.parse(in_handle, "genbank"):
# delete annotations
record.annotations = {}
# loop through features to find the source
for i in range(0,len(record.features)):
# if found, delete it and stop (only expect one source)
if(record.features[i].type == "source"):
record.features.pop(i)
break
records.append(record)

GFF.write(records, out_handle)

in_handle.close()
out_handle.close()

convert_to_GFF3()

关于python - BCBio 的 GFF 解析器解析不正确,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20190209/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com