gpt4 book ai didi

python - 如何在 python 爬虫中访问具有多个页面的表单的 pubmed 数据

转载 作者:太空宇宙 更新时间:2023-11-04 01:27:34 24 4
gpt4 key购买 nike

我正在尝试使用 python 抓取 pubmed 并获取一篇文章被引用的所有论文的 pubmed ID。

例如本文(ID:11825149) http://www.ncbi.nlm.nih.gov/pubmed/11825149有一个页面链接到所有引用它的文章: http://www.ncbi.nlm.nih.gov/pubmed?linkname=pubmed_pubmed_citedin&from_uid=11825149问题是它有超过 200 个链接,但每页只显示 20 个。 “下一页”链接无法通过 url 访问。

有没有办法用 python 打开“发送到”选项或查看下一页的内容?

我目前如何打开已发布的页面:

def start(seed):
webpage = urlopen(seed).read()
print webpage


citedByPage = urlopen('http://www.ncbi.nlm.nih.gov/pubmedlinkname=pubmed_pubmed_citedin&from_uid=' + pageid).read()
print citedByPage

从这里我可以提取第一页上链接引用的所有内容,但是如何从所有页面中提取它们呢?谢谢。

最佳答案

我能够使用此页面中的方法获取 ID 引用 http://www.bio-cloud.info/Biopython/en/ch8.html

Back in Section 8.7 we mentioned ELink can be used to search for citations of a given paper. Unfortunately this only covers journals indexed for PubMed Central (doing it for all the journals in PubMed would mean a lot more work for the NIH). Let’s try this for the Biopython PDB parser paper, PubMed ID 14630660:

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"
>>> pmid = "14630660"
>>> results = Entrez.read(Entrez.elink(dbfrom="pubmed", db="pmc",
... LinkName="pubmed_pmc_refs", from_uid=pmid))
>>> pmc_ids = [link["Id"] for link in results[0]["LinkSetDb"][0]["Link"]]
>>> pmc_ids
['2744707', '2705363', '2682512', ..., '1190160']
Great - eleven articles. But why hasn’t the Biopython application note been found (PubMed ID 19304878)? Well, as you might have guessed from the variable names, there are not actually PubMed IDs, but PubMed Central IDs. Our application note is the third citing paper in that list, PMCID 2682512.

So, what if (like me) you’d rather get back a list of PubMed IDs? Well we can call ELink again to translate them. This becomes a two step process, so by now you should expect to use the history feature to accomplish it (Section 8.15).

But first, taking the more straightforward approach of making a second (separate) call to ELink:

>>> results2 = Entrez.read(Entrez.elink(dbfrom="pmc", db="pubmed", LinkName="pmc_pubmed",
... from_uid=",".join(pmc_ids)))
>>> pubmed_ids = [link["Id"] for link in results2[0]["LinkSetDb"][0]["Link"]]
>>> pubmed_ids
['19698094', '19450287', '19304878', ..., '15985178']
This time you can immediately spot the Biopython application note as the third hit (PubMed ID 19304878).

Now, let’s do that all again but with the history …TODO.

And finally, don’t forget to include your own email address in the Entrez calls.

关于python - 如何在 python 爬虫中访问具有多个页面的表单的 pubmed 数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16746410/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com