python - 抓取 Google Scholar 时防止 503 错误-6ren

python - 抓取 Google Scholar 时防止 503 错误

转载作者：行者123 更新时间：2023-11-28 19:10:18

31

4

我编写了以下代码来从 Google Scholar security page. 中抓取数据.但是，每当我运行它时，我都会收到此错误:

 Traceback (most recent call last):
  File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 53, in <module>
    getProfileFromTag(each)
  File "/Users/.../Documents/GS_Tag_Scraper/scrape-modified.py", line 32, in getProfileFromTag
    page = urllib.request.urlopen(url)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 163, in urlopen
    return opener.open(url, data, timeout)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 504, in error
    result = self._call_chain(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 696, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 472, in open
    response = meth(req, response)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 582, in http_response
    'http', request, response, code, msg, hdrs)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 510, in error
    return self._call_chain(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 444, in _call_chain
    result = func(*args)
  File "/Users/.../anaconda/lib/python3.5/urllib/request.py", line 590, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

我认为这是因为 GS 阻止了我的请求。我怎样才能避免这种情况？

代码是:

# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import urllib.request
import string
import csv
import time

#Declares array's to store data
name = []
urlList =[]

#Opens and writer header of CSV file
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
outputWriter.writerow(['Name', 'URL', 'Total Citations', 'h-index', 'i10-index'])

def getStat (url):
    #Given an authors URL it retunrs an array of stats.
    url = 'https://scholar.google.pl' + url
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, 'lxml')
    buttons = soup.findAll("td", { "class" : "gsc_rsb_std" })
    list=[]
    return (list)

def getProfileFromTag(tag):
    url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:" + tag
    while True:
        page = urllib.request.urlopen(url)
        soup = BeautifulSoup(page, 'lxml')

        mydivs = BeautifulSoup(urllib.request.urlopen(url), 'lxml').findAll("h3", { "class" : "gsc_1usr_name"})
        for each in mydivs:
            for anchor in each.find_all('a'):
                name.append(anchor.text)
                urlList.append(anchor['href'])
                time.sleep(0.001)
        buttons = soup.findAll("button", {"aria-label": "Następna"})
        if not buttons:
            break
        on_click = buttons[0].get('onclick')
        url = 'http://scholar.google.pl' + on_click[17:-1]
        url = url.encode('utf-8').decode('unicode_escape')
    for each in name:
        list = getStat(urlList[i])
        outputWriter.writerow([each, urlList[i], list[0], list[2], list[4]])

tags = ['security']
for each in tags:
    getProfileFromTag(each)

最佳答案

改为使用 requests 以及适当的请求 header 。

import requests

url = 'https://scholar.google.pl/citations?view_op=search_authors&mauthors=label:security'

request_headers = {
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'accept-encoding': 'gzip, deflate, br',
    'accept-language': 'en-US,en;q=0.8',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}

with requests.Session() as s:
    r = s.get(url, headers=request_headers)

得到的结果:

Adrian Perrig    /citations?user=n-Oret4AAAAJ&hl=pl
Vern Paxson      /citations?user=HvwPRJ0AAAAJ&hl=pl
Frans Kaashoek   /citations?user=YCoLskoAAAAJ&hl=pl
Mihir Bellare    /citations?user=2pW1g5IAAAAJ&hl=pl
Xuemin Shen      /citations?user=Bjl3GwoAAAAJ&hl=pl
Helen J. Wang    /citations?user=qhu-DxwAAAAJ&hl=pl
Sushil Jajodia   /citations?user=lOZ1vHIAAAAJ&hl=pl
Martin Abadi     /citations?user=vWTI60AAAAAJ&hl=pl
Jean-Pierre Hubaux   /citations?user=W7YBLlEAAAAJ&hl=pl
Ross Anderson    /citations?user=WgyDcoUAAAAJ&hl=pl

使用这个:

users = soup.findAll('h3', {'class': 'gsc_oai_name'})
for user in users:
    name = user.a.text.strip()
    link = user.a['href']
    print(name, '\t', link)

您可以通过研究 Chrome 开发者工具的网络选项卡找到浏览器发送的 header 。

关于python - 抓取 Google Scholar 时防止 503 错误，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41331881/

31

4

0

文章推荐： javascript - 速记符号中的无类对象

文章推荐： javascript - 我是否违反了面向对象的封装原则？

文章推荐： javascript - JS 正则表达式匹配句子

文章推荐： iphone - 设置委托(delegate)时的警告

google-apps-script - 使用 Google 应用程序脚本刷新从 Google 表格粘贴到 Google 幻灯片中的表格
这里的这个问题对 updating Google Sheets charts linked to Google slides 有一个简洁的解决方案. function onOpen() { var
google-apps-script - 用于将 Google 表单添加到 Google 类作业的 Google 脚本
我正在尝试将 Google 表单添加到 Google 类作业中，但似乎不可能。首先，它在这里 ( https://developers.google.com/classroom/reference/
google-visualization - Google 日期时间格式化程序不适用于 Google 折线图
出于某种原因，无论我做什么以及我如何尝试，这个日期格式化程序都不起作用。工具提示仍然显示错误的格式。你可以试试代码here . 在代码中我必须注释掉 formatter.format(dataTabl
google-analytics - Google Analytics - 是否必须从托管 Google Analytics 帐户的 Google 配置文件创建服务帐户？
我目前正在使用访问 token 和刷新 token 从 Google Analytics Reporting API (v4) 中提取数据。当我致力于自动从 Google Analytics 中提取数
google-sheets - Google 电子表格中 Google 驱动器中的引用文件
我已在 Google 云端硬盘中创建了一个文件夹，例如测试一下，放入3个文件 a.jpg, b.jpg, c.jpg 我希望在同一帐户下的 Google 电子表格中访问文件，例如生成图像文件的链接，可
google-apps-script - 在 Google 网站中嵌入 Google 电子表格时，Google Apps 脚本可帮助解决错误？
电子表格 A 是欢迎新移民来到我们小镇的团队的主数据源。它里面有大量非常敏感的数据，不能公开，哪怕是一点点。 (我们谈论的是 child 的姓名和出生日期以及他们在哪里上学……保证电子表格 A 的安全
google-apps-script - 使用 Google Apps 脚本从 Google 表格数据表复制到 Google 文档表
有没有办法在 Google 文档中编写 Google Apps 脚本以从 Google 表格中检索仅限于非空白行的范围并将这些行显示为表格？我正在寻找一个脚本，用于使用 Google Apps 脚本
google-apps-script - 使用 Google Apps 脚本从 Google 表格数据表复制到 Google 文档表
有没有办法在 Google 文档中编写 Google Apps 脚本以从 Google 表格中检索仅限于非空白行的范围并将这些行显示为表格？我正在寻找一个脚本，用于使用 Google Apps 脚本
google-apps-script - 使用 Google Apps 脚本从 Google Firebase 写入 Google Sheets
尝试检索存储在 google firebase 中名为条目的节点下的表单条目，并使用谷歌工作表中的脚本编辑器附加到谷歌工作表。我已将 FirebaseApp 库添加到谷歌表脚本编辑器。然后我的代码看
google-oauth - Google oauth - 限制登录到特定的 google 组
是否可以将我的 Web 应用程序的登录限制为仅限 google 组中的帐户？我不希望每个人都可以使用他们的私有(private) gmail 登录，而只能使用我的 google 组中的用户。最佳答
google-oauth - 带有 Google 自定义搜索功能的 Google 附加链接搜索框
我们想使用 Google 自定义搜索实现 Google 附加链接搜索框。在谷歌 documentation , 我发现我们需要包含以下代码来启用附加链接搜索框 { "@context"
google-trends - 我可以将 Google 趋势图添加到 Google 数据洞察吗？
我想将特定搜索词的 Google 趋势图表添加到我的 Google Data Studio 报告中，但趋势不是数据源列表中的选项。我也找不到嵌入 JavaScript 的选项。是否可以将趋势图表添加到
google-drive-api - 将文件从 Google Drive 复制到 Google 内的 Google Cloud Storage
是否可以将文件从 Google Drive 复制到 Google Cloud Storage？我想它会非常快，因为两者都在类似的存储系统上。我还没有看到有关无缝执行此操作的任何方法的任何信息，而无需
google-analytics - Google Universal Analytics Google-自定义维度
之间有什么区别 ga('send', 'pageview', { 'dimension1': 'data goes here' }); 和 ga('set', 'dimension1', 'da
google-analytics - Google Universal Analytics Google-浏览量
我正在尝试记录每个博客站点作者的点击率。 ga('send', 'pageview'); (in the header with the ga code to track each page) ga(
google-analytics - 自定义变量值未从 Google 跟踪代码管理器传递到 Google Analytics
我设置了 Google Tag Manager 和 2 个数据层变量:一个用于跟踪用户 ID，传递给 Google Analytics 以同步用户 session ，另一个用于跟踪访问者类型。在使用
google-search - Google for Jobs 显示的工作位置不正确(Google 使用的是我们公司的总部)
我在我们的网站上遇到多个职位发布的问题。我们在加拿大多个地点提供工作机会。所有职位页面都包含一个“LD+JSON ”职位发布的结构化数据，基于 Google 的职位发布文档: https://dev
google-analytics - 无需 Google 帐户即可访问 Google Analytics
公司未使用 Google 套件，使用个人(消费者)帐户(甚至是 Google 帐户)违反公司政策。需要访问 Google Analytics - 没有 Google 帐户是否可能？谢谢最佳答案
google-analytics - Google Play 应用页面的 Google Analytics
我想分析人们使用哪些搜索关键字在 Play 商店中找到我的应用。那可能吗？我怎么能这样做？最佳答案自 2013 年 10 月起，您可以关联您的 Google Analytics(分析)和 Goo
google-api - 是否有用于访问 Google Now 或 Google Keep 中设置的提醒的公共(public) Google API？
Google Now 和 Google Keep 中基于时间和位置的提醒与 Google Calendar 事件提醒不同。是否有公共(public) API 可以访问 Now 和 Keep 中的这些事

首页

博学

6Ren·AI

商城

python - 抓取 Google Scholar 时防止 503 错误