python - 为什么urlopen可以下载谷歌搜索页面，不能下载谷歌学术搜索页面？-6ren

python - 为什么urlopen可以下载谷歌搜索页面，不能下载谷歌学术搜索页面？

转载作者：太空宇宙更新时间：2023-11-04 10:53:57

26

4

我正在使用 Python 3.2.3 urllib.request 模块下载 Google 搜索结果，但我在 urlopen 适用于指向 Google 搜索结果的链接，但不适用于 Google 学术搜索。在此示例中，我正在搜索 "JOHN SMITH"。此代码成功打印 HTML:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google
try:
    page_google = '''http://www.google.com/#hl=en&sclient=psy-ab&q=%22JOHN+SMITH%22&oq=%22JOHN+SMITH%22&gs_l=hp.3..0l4.129.2348.0.2492.12.10.0.0.0.0.154.890.6j3.9.0...0.0...1c.gjDBcVcGXaw&pbx=1&bav=on.2,or.r_gc.r_pw.r_qf.,cf.osb&fp=dffb3b4a4179ca7c&biw=1366&bih=649'''
    req_google = Request(page_google)
    req_google.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_google = urlopen(req_google).read()
    print(html_google[0:10])
except URLError as e:
    print(e)

但是这段代码对 Google Scholar 做了同样的事情，引发了一个 URLError 异常:

from urllib.request import urlopen, Request
from urllib.error import URLError

# Google Scholar
try:
    page_scholar = '''http://scholar.google.com/scholar?hl=en&q=%22JOHN+SMITH%22&btnG=&as_sdt=1%2C14'''
    req_scholar = Request(page_scholar)
    req_scholar.add_header('User Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20120427 Firefox/15.0a1')
    html_scholar = urlopen(req_scholar).read()
    print(html_scholar[0:10])
except URLError as e:
    print(e)

回溯:

Traceback (most recent call last):
  File "/home/ak5791/Desktop/code-sandbox/scholar/crawler.py", line 6, in <module>
    html = urlopen(page).read()
  File "/usr/lib/python3.2/urllib/request.py", line 138, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/lib/python3.2/urllib/request.py", line 369, in open
    response = self._open(req, data)
  File "/usr/lib/python3.2/urllib/request.py", line 387, in _open
    '_open', req)
  File "/usr/lib/python3.2/urllib/request.py", line 347, in _call_chain
    result = func(*args)
  File "/usr/lib/python3.2/urllib/request.py", line 1155, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/lib/python3.2/urllib/request.py", line 1138, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -5] No address associated with hostname>

我通过在 Chrome 中搜索并从那里复制链接获得了这些链接。一位评论者报告了 403 错误，我有时也会遇到这种情况。我想这是因为谷歌不支持抓取学术搜索。但是，更改用户代理字符串并不能解决这个或原来的问题，因为大多数时候我都会遇到URLErrors。

最佳答案

This PHP script似乎表明您需要在 Google 为您提供结果之前设置一些 cookie:

/*

 Need a cookie file (scholar_cookie.txt) like this:

# Netscape HTTP Cookie File
# http://curlm.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

.scholar.google.com     TRUE    /       FALSE   2147483647      GSP     ID=353e8f974d766dcd:CF=2
.google.com     TRUE    /       FALSE   1317124758      PREF    ID=353e8f974d766dcd:TM=1254052758:LM=1254052758:S=_biVh02e4scrJT1H
.scholar.google.co.uk   TRUE    /       FALSE   2147483647      GSP     ID=f3f18b3b5a7c2647:CF=2
.google.co.uk   TRUE    /       FALSE   1317125123      PREF    ID=f3f18b3b5a7c2647:TM=1254053123:LM=1254053123:S=UqjRcTObh7_sARkN

*/

Python recipe for Google Scholar comment 证实了这一点，其中包括一条警告，指出 Google 会检测脚本，如果您过度使用它，将会禁用您。

关于python - 为什么urlopen可以下载谷歌搜索页面，不能下载谷歌学术搜索页面？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11484250/

26

4

0

文章推荐： java - 具有双嵌入键的 SqlResultSetMapping

文章推荐： javascript - 如何将 'rowspan' 特定的 CSS 放入 html 表中？

文章推荐： html - 等高元素，在 flexbox 容器内底部对齐，纯 CSS

文章推荐： python - 属性错误 : 'function' object has no attribute 'values'

java - 3DES的手动实现(学术)
对于我正在学习的类(class)，我们正在手动实现 3DES 方案，这在纸面上非常简单(双 key ，使用 EDE 加密)。我选择了 Java 作为实现语言，但遇到了一个问题，即它如何使用不同的 ke
java - DES 蛮力(学术)
我正在上计算机安全类(class)，我们的一项任务是对具有弱 key 的 DES 进行暴力破解。我的代码: public static void main(String[] args) th
Java 3DES 实现(学术)
到目前为止我的代码是: public class TripleDES { /** * @param args the command line arguments * @t
Java 3DES 实现(学术)
到目前为止我的代码是: public class TripleDES { /** * @param args the command line arguments * @t
Qt 研究/学术/期刊论文/文章
我正在尝试查找任何分析 Qt 和 Qt Creator 最新版本的研究/学术/期刊论文/文章。具体来说，我试图从实时安全关键的角度评估 Qt，所以任何信息都是有帮助的。附言我尝试了典型的搜索方法:
java - 学术 Java int[] 合并排序反转一半的输出
我在这里试图找到我在 Java 中的合并排序实现中的错误，但我的头发掉了: Input: 10 9 8 7 6 5 4 3 2 1 Output: 5 4 3 2 1 10 9 8 7 6
json - Microsoft 学术 API、知识图搜索 — ReferenceID 始终为空
我正在使用graph search Microsoft 学术 API 检索论文的引文 ID 和引用 ID 的方法。然而，虽然检索引文 ID 有效，但引用 ID 字段始终为空，即使对于应该具有链接引用文
microsoft-cognitive - 从 Microsoft 学术 API 获取研究领域(学科)层次结构信息
是否有任何好的方法/解决方法来获取学科和子字段层次结构信息？例如，光学是物理学的子学科，它有 gem 学、光学物理学等子学科。哪个实习生可能有另一个子研究领域？最佳答案首先，使用 these i
html - Bootstrap Content Spills over Width of Navbar on Mobile -- 学术 session 网站
我正在使用 Bootstrap 为生物医学工程 session 设计一个网站，但在针对移动设备进行配置时遇到了一些问题。具体来说，在标题为 cochairs.html 的页面上，折叠的导航栏与其下方

首页

博学

6Ren·AI

商城

python - 为什么urlopen可以下载谷歌搜索页面，不能下载谷歌学术搜索页面？