gpt4 book ai didi

python - 解析 GTF 基因文件

转载 作者:太空宇宙 更新时间:2023-11-04 05:42:03 25 4
gpt4 key购买 nike

我有一个我试图解析的基因 GTF 文件,因此“gene_id”、“gene_type”、“gene_status”、“gene_name”和级别都在单独的列中。

因此对于我的原始文件:

chr1 |  ENSEMBL gene|   17369|  17436|  .   -   .   |gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1 | ENSEMBL gene| 30366| 30503| . + . |gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1 | ENSEMBL gene| 157784| 157887| . - . |gene_id "ENSG00000222623.1"; gene_type "snRNA"; gene_status "KNOWN"; gene_name "RNU6-1100P"; level 3;
chr1 | ENSEMBL gene| 187891| 187958| . - . |gene_id "ENSG00000273874.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-2"; level 3;

我希望它看起来像这样,“gene_id”、“gene_type”、“gene_status”、“gene_name”和级别都在单独的列中:

 chr1   |ENSEMBL    |gene|  17369|  |17436  |.  -   .   |gene_id "ENSG00000278267.1"   |gene_type "miRNA"   |gene_status "KNOWN"   |gene_name "MIR6859-1"   |level 3
chr1 |ENSEMBL |gene| 30366| 30503 |. + . |gene_id "ENSG00000274890.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR1302-2" |level 3
chr1 |ENSEMBL |gene| 157784| 157887 |. - . |gene_id "ENSG00000222623.1" |gene_type "snRNA" |gene_status "KNOWN" |gene_name "RNU6-1100P" |level 3
chr1 |ENSEMBL |gene| 187891| 187958 |. - . |gene_id "ENSG00000273874.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR6859-2" |level 3

我尝试使用 gffutils 解析它,使用它们提供的基本代码:

import gffutils


db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')

print(list(db.featuretypes()))

# Here's how to write genes out to file
with open('sRNA.gene.gtf', 'w') as fout:
for gene in db.features_of_type('gene'):
fout.write(str(gene) + '\n')

但是,我收到“ImportError: cannot import name 'feature:'”

ImportError                               Traceback (most recent call last)
<ipython-input-26-4dd7cd5c7e24> in <module>()
2
3
----> 4 db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
5
6 #db = gffutils.FeatureDB('sRNA.gene.gtf.db')

我不确定这里出了什么问题,现在正在考虑尝试使用命令行来解析它。谁能就解析 GTF 文件的最佳方式提供一些建议?

提前致谢。

最佳答案

您想将 GTF 文件中的多个定界符更改为单个制表符定界符。完成后,该文件不再是 GTF 文件。

下面的代码会将GTF文件的内容获取到一个文本文件中

import gffutils
try:
db = gffutils.create_db("sample.gtf", dbfn='sample.db')
except:
pass
db = gffutils.FeatureDB('sample.db', keep_order=True)
with open('sample.txt', 'w') as fout:
for line in db.all_features():
line = str(line)
line = line.split(";") #make your parsing changes here
fout.write(str(line) + '\n')

请注意,您只能使用一次 create_db() 方法。这就是我将其注释掉的原因。

编辑

添加了try语句

关于python - 解析 GTF 基因文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33585332/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com