gpt4 book ai didi

python - Entrez epost + elink 使用 Biopython 返回乱序结果

转载 作者:太空宇宙 更新时间:2023-11-03 18:11:56 26 4
gpt4 key购买 nike

我今天遇到了这个,想把它扔掉。看来使用 NCBI 的 Entrez Biopython 接口(interface),不可能以正确(与输入相同)的顺序返回结果(至少从 elink )。请参阅下面的代码示例。我有数千个地理标志,我需要获取其分类信息,并且由于 NCBI 的限制,单独查询它们的速度非常慢。

from Bio import Entrez
Entrez.email = "my@email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))

print "-------"

for i in ids:
search_results = Entrez.read(Entrez.epost("protein", id=i))
webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"]
print Entrez.read(Entrez.elink(webenv=webenv,
query_key=query_key,
dbfrom="protein",
db="taxonomy"))

结果:

[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}, {u'Id': '81972'}, {u'Id': '32630'}, {u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191', '297793721', '48525513', '507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]
-------
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '3332'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['148908191'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '81972'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['297793721'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '211604'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['48525513'], u'LinkSetDbHistory': [], u'ERROR': []}]
[{u'LinkSetDb': [{u'DbTo': 'taxonomy', u'Link': [{u'Id': '32630'}], u'LinkName': 'protein_taxonomy'}], u'DbFrom': 'protein', u'IdList': ['507118461'], u'LinkSetDbHistory': [], u'ERROR': []}]

NCBI 的 elink 文档 ( http://www.ncbi.nlm.nih.gov/books/NBK25499/ ) 说这应该是可能的,但只能通过传递多个“id=”,但这对于 Biopython epost 接口(interface)来说似乎不可能。有没有其他人看到过这个,或者我错过了一些明显的东西。

谢谢!

最佳答案

from Bio import Entrez


Entrez.email = "my@email.com"
ids = ["148908191", "297793721", "48525513", "507118461"]
search_results = Entrez.read(Entrez.epost("protein", id=','.join(ids)))

xml = Entrez.efetch("protein",
query_key=search_results["QueryKey"],
WebEnv=search_results["WebEnv"],
rettype="gp",
retmode="xml")

for record in Entrez.read(xml):
print [x[3:] for x in record["GBSeq_other-seqids"] if x.startswith("gi")]
gb_quals = record["GBSeq_feature-table"][0]["GBFeature_quals"]
for qualifier in gb_quals:
if qualifier["GBQualifier_name"] == "db_xref":
print qualifier["GBQualifier_value"]

# Or with list comprehension
# print [q["GBQualifier_value"] for q in
# record["GBSeq_feature-table"][0]["GBFeature_quals"] if
# q["GBQualifier_name"] == "db_xref"]


xml.close()

efetch 查询,然后在使用 Entrez.read() 读取 xml 后对其进行解析。这就是事情变得困惑的地方,你必须深入研究 xml-dict-list。我想有一种方法可以提取“GBFeature_quals”,其中“GBQualifier_name”是“db_xref”,比我的更好......但这有效(到目前为止)。输出:

['148908191']
taxon:3332

['297793721']
taxon:81972

['48525513']
taxon:211604

['507118461']
taxon:32630

关于python - Entrez epost + elink 使用 Biopython 返回乱序结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25775309/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com