gpt4 book ai didi

python - 如何修复由于服务器阻止网页抓取而产生的这些错误?

转载 作者:行者123 更新时间:2023-12-01 09:25:26 26 4
gpt4 key购买 nike

我正在尝试使用“get_text”函数从网页获取文本,如所述 here .

import urllib.request
from inscriptis import get_text

url = "http://www.informationscience.ch"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这对于这个特定的网站来说效果很好,但是当我尝试从另一个网站抓取时,我收到 403 错误:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url).read().decode('utf-8')

text = get_text(html)

print(text)

这会在 html = urllib.request.urlopen(url).read().decode('utf-8') 行中出现以下错误:

HTTPError: HTTP Error 403: Forbidden
<小时/>

我尝试通过指定用户代理来修复它,如下所示:

import urllib.request
from inscriptis import get_text

url = "https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms"
html = urllib.request.urlopen(url, headers={'User-Agent': 'Mozilla/5.0'}).read().decode('utf-8')

text = get_text(html)

print(text)

但我收到以下错误:

TypeError: urlopen() got an unexpected keyword argument 'headers'
<小时/>

由于错误提示 headers 未定义 urlopen,因此我尝试使用 requests 模块指定用户代理,如下所示:

from inscriptis import get_text
import requests
url = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(url))

但这会出现以下错误:

AttributeError: 'Response' object has no attribute 'strip'

我该如何让这个该死的服务器停止阻止我的网络抓取?

最佳答案

您需要处理响应正文,而不是响应对象本身:

response = requests.get('https://economictimes.indiatimes.com/markets/stocks/news/birla-group-enters-the-fray-to-acquire-idbi-federal-life/articleshow/64251332.cms', "lxml", headers={'User-Agent': 'Mozilla/5.0'})

print(get_text(response.text))

关于python - 如何修复由于服务器阻止网页抓取而产生的这些错误?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50444289/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com