gpt4 book ai didi

python - 使用Python爬行WoS

转载 作者:太空宇宙 更新时间:2023-11-03 15:53:12 25 4
gpt4 key购买 nike

我正在尝试从 WoS(Web of Science)数据库下载信息。我需要文章名称、作者、被引用次数、卷数等信息 enter image description here

这是我的代码:

import sys 
from BeautifulSoup import BeautifulSoup
import urllib
import re
var = raw_input("Link WoS: ")
conn = urllib.urlopen(var)
html = conn.read()
soup = BeautifulSoup(html)
titles = re.findall('<value lang_id="">(.+?)</value>',str(soup))
volume = re.findall('Volume: </span><span class="data_bold"><value>(.+?)</value>', str(soup))
print(volume)

它非常适合获得头衔。但是,我在获取以下信息时遇到问题:卷、期、页数、日期(发布)和引用次数。这是网页来源:

</span><span name="source_title_1" id="source_title_1">
<value>
<span class="hitHilite">EDUCATIONAL RESEARCH</span>
</value>
</span>&nbsp;&nbsp;<span class="label">Volume: </span><span class="data_bold">
<value>35</value>
</span> &nbsp;&nbsp;<span class="label">Issue: </span><span class="data_bold">
<value>1</value>
</span> &nbsp;&nbsp;<span class="label">Pages: </span><span class="data_bold">
<value>3-25</value>
</span> &nbsp;&nbsp;<span class="label">Published: </span><span class="data_bold">
<value>SPR 1993</value>
</span>
</div>
<div style="display: inline-block" id="links_1">
<nobr><span id="links_openurl_1"></span> <span id="links_full_text_1"> </span> <span id="links_doc_del_1"></span> <span id="links_patent_1"> </span> </nobr>
</div>
<div class="search-action-item">
<span id="solo_full_text_1" class="solo_full_text"></span><a name="full_text_1" id="full_text_1" title="Full Text" class="button2link button-ft" href="javascript:;"><span id="full_text_1" name="full_text_1" title="Full Text" class="button2 button-ft">Full Text</span></a>
<div class="popup-full-text" id="full_text_1_menu">
<span id="full_text_1_links"></span>
</div>
</div>
<script type="text/javascript">$("#full_text_1").hide();</script><span style="display: inline-block" class="button-abstract" id="ViewAbstract1_text"><a title="View Abstract" alt="View Abstract" onclick="return hide_show_abstract('1', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'View Abstract', 'Close Abstract');" href="javascript:;" class="button9"><img align="absmiddle" title="View Abstract" alt="View Abstract" src="http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif" id="ViewAbstract1_img">View Abstract<nobr></nobr></a></span><span style="display: none" class="button-abstract" id="HideAbstract1_text"><a title="Close Abstract" alt="Close Abstract" onclick="return hide_show_abstract('1', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif', 'View Abstract', 'Close Abstract');" href="javascript:;" class="button9"><img align="absmiddle" title="Close Abstract" alt="Close Abstract" src="http://images.webofknowledge.com/WOKRS523R4/images/spacer.gif" id="HideAbstract1_img">Close Abstract<nobr></nobr></a></span><span style="display: none" url="http://apps.webofknowledge.com/ViewAbstract.do?product=WOS&amp;search_mode=GeneralSearch&amp;viewType=ViewAbstract&amp;qid=5&amp;SID=W1tvVEGCvoimqQujw4V&amp;page=1&amp;doc=1" id="ViewAbstract_Span1">
<!----></span></div><div class="search-results-data">
<div class="search-results-data-cite">Times Cited: <a title="View all of the articles that cite this one" href="/CitingArticles.do?product=WOS&amp;SID=W1tvVEGCvoimqQujw4V&amp;search_mode=CitingArticles&amp;parentProduct=WOS&amp;parentQid=5&amp;parentDoc=1&amp;REFID=448550&amp;excludeEventConfig=ExcludeIfFromNonInterProduct">487</a>
<br>

我认为我有问题,因为数据是数字......你能帮我吗?

最佳答案

Beautifulsoup 有自己的正则表达式功能

html = '<html><span>Volume: </span><span class="data_bold"><value>20</value></span></html>'
soup = BeautifulSoup(html)
matches = soup.findAll(text=re.compile('Volume'))
for match in matches:
element = match.parent
#o/p: <span>Volume: </span>
sibling_tag = element.findNextSibling()
#o/p: <span class="data_bold"><value>20</value></span>
print sibling_tag.find('value').text
#o/p: u'20'

注意:这只是一个示例,无法访问实际的 html

关于python - 使用Python爬行WoS,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41065038/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com