gpt4 book ai didi

python - 如何在 biopython entrez.esearch 中下载完整的基因组序列

转载 作者:行者123 更新时间:2023-11-28 17:45:45 25 4
gpt4 key购买 nike

我必须从 NCBI(GenBank(完整)格式)下载完整的基因组序列。我感兴趣的是“完整基因组”而不是“全基因组”。

我的脚本:

from Bio import Entrez
Entrez.email = "asiakXX@wp.pl"
gatunek='Escherichia[ORGN]'
handle = Entrez.esearch(db='nucleotide',
term=gatunek, property='complete genome' )#title='complete genome[title]')
result = Entrez.read(handle)

结果我只得到基因组的小片段,大小约为 484 bp:

LOCUS       NZ_KE350773              484 bp    DNA     linear   CON 23-AUG-2013
DEFINITION Escherichia coli E1777 genomic scaffold scaffold9_G, whole genome
shotgun sequence.

我知道如何通过 NCBI 网站手动完成,但它非常耗时,我在那里使用的查询:

escherichia[orgn] AND complete genome[title]

结果我得到了多个基因组,大小范围约为 5,154,862 bp,这是我需要通过 ENTREZ.esearch 完成的。

最佳答案

你的问题很明确,但完整的答案很长。我提供的代码为每个所需的大肠杆菌基因组序列生成一个 .fasta 文件,是的,只有 NCBI 中的“Complete Genomes”。

您会看到 NCBI 中只有 六个 完整的大肠杆菌引用基因组 (http://www.ncbi.nlm.nih.gov/genome/167):

enter image description here

为了帮助您,这里是指向他们基因组的 Genbank/Refseq 链接:

  1. http://www.ncbi.nlm.nih.gov/nuccore/NC_000913.3

  2. http://www.ncbi.nlm.nih.gov/nuccore/NC_002695.1

  3. http://www.ncbi.nlm.nih.gov/nuccore/NC_011750.1

  4. http://www.ncbi.nlm.nih.gov/nuccore/NC_011751.1

  5. http://www.ncbi.nlm.nih.gov/nuccore/NC_017634.1

  6. http://www.ncbi.nlm.nih.gov/nuccore/NC_018658.1

这是我将完整基因组序列解析为 .FASTA 文件的代码...

# Imports
from Bio import Entrez
from Bio import SeqIO

#############################
# Retrieve NCBI Data Online #
#############################

Entrez.email = "asiak@wp.pl" # Always tell NCBI who you are
genomeAccessions = ['NC_000913', 'NC_002695', 'NC_011750', 'NC_011751', 'NC_017634', 'NC_018658']
search = " ".join(genomeAccessions)
handle = Entrez.read(Entrez.esearch(db="nucleotide", term=search, retmode="xml"))
genomeIds = handle['IdList']
records = Entrez.efetch(db="nucleotide", id=genomeIds, rettype="gb", retmode="text")

###############################
# Generate Genome Fasta files #
###############################

sequences = [] # store your sequences in a list
headers = [] # store genome names in a list (db_xref ids)

for i,record in enumerate(records):

file_out = open("genBankRecord_"+str(i)+".gb", "w") # store each genomes .gb in separate files
file_out.write(record.read())
file_out.close()

genomeGenbank = SeqIO.read("genBankRecord"+str(i)+".gb", "genbank") # parse in the genbank files
header = genome.features[0].qualifiers['db_xref'][0] # name the genome using db_xfred ID
sequence = genome.seq.tostring() # obtain genome sequence

headers.append('>'+header) # store genome name in list
sequences.append(sequence) # store sequence in list

fasta_out = open("genome"+str(i)+".fasta","w") # store each genomes .fasta in separate files
fasta_out.write(header) # >header ... followed by:
fasta_out.write(sequence) # sequence ...
fasta_out.close() # close that .fasta file and move on to next genome
records.close()

让我知道进展如何!安迪

关于python - 如何在 biopython entrez.esearch 中下载完整的基因组序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18461629/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com