gpt4 book ai didi

python - 为什么urlopen可以下载谷歌搜索页面,不能下载谷歌学术搜索页面?

转载 作者:太空宇宙 更新时间:2023-11-04 10:53:57 26 4
gpt4 key购买 nike

我正在使用 Python 3.2.3 urllib.request 模块下载 Google 搜索结果,但我在 urlopen 适用于指向 Google 搜索结果的链接,但不适用于 Google 学术搜索。在此示例中,我正在搜索 "JOHN SMITH"。此代码成功打印 HTML:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google
try:
page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
req_google = Request(page_google)
req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_google = urlopen(req_google).read()
print(html_google[0:10])
except URLError as e:
print(e)

但是这段代码对 Google Scholar 做了同样的事情,引发了一个 URLError 异常:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google Scholar
try:
page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
req_scholar = Request(page_scholar)
req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
html_scholar = urlopen(req_scholar).read()
print(html_scholar[0:10])
except URLError as e:
print(e)

回溯:

Traceback (most recent call last):
File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
html = urlopen(page).read()
File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 369, in open
response = self._open(req, data)
File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
'_open', req)
File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
result = func(*args)
File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>

我通过在 Chrome 中搜索并从那里复制链接获得了这些链接。一位评论者报告了 403 错误,我有时也会遇到这种情况。我想这是因为谷歌不支持抓取学术搜索。但是,更改用户代理字符串并不能解决这个或原来的问题,因为大多数时候我都会遇到URLErrors

最佳答案

This PHP script似乎表明您需要在 Google 为您提供结果之前设置一些 cookie:

/*

Need a cookie file (scholar_cookie.txt) like this:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com TRUE / FALSE 2147483647 GSP ID=353e8f974d766dcd:CF=2
.google.com TRUE / FALSE 1317124758 PREF ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk TRUE / FALSE 2147483647 GSP ID=f3f18b3b5a7c2647:CF=2
.google.co.uk TRUE / FALSE 1317125123 PREF ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

*/

Python recipe for Google Scholar comment 证实了这一点,其中包括一条警告,指出 Google 会检测脚本,如果您过度使用它,将会禁用您。

关于python - 为什么urlopen可以下载谷歌搜索页面,不能下载谷歌学术搜索页面?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11484250/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com