gpt4 book ai didi

python-3.x - urlopen 返回有效链接的重定向错误

转载 作者:行者123 更新时间:2023-12-01 09:53:56 24 4
gpt4 key购买 nike

我正在用 python 构建一个断开的链接检查器,它正在成为一项苦差事,构建正确识别使用浏览器访问时无法解析的链接的逻辑。我找到了一组链接,我可以在其中始终使用我的刮刀重现重定向错误,但在浏览器中访问时可以完美解决。我希望我能在这里找到一些见解。

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError

try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
response = urllib.request.urlopen(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)


print(output)

在这种情况下,可靠地返回此错误的 URL 示例是“ http://forums.hostgator.com/want-see-your-sites-dns-propagating-t48838.html”。 '。它在访问时完美解决,但上面的代码将返回以下错误:
HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently

有什么想法可以正确地将这些链接识别为功能性的,而不会盲目地忽略来自该站点的链接(可能会错过真正断开的链接)?

最佳答案

您会收到无限循环错误,因为当客户端未发送 cookie 时,您要抓取的页面使用 cookie 和重定向。当您禁用 cookie 时,大多数其他抓取工具和浏览器都会出现相同的错误。

您需要一个 http.cookiejar.CookieJar和一个 urllib.request.HTTPCookieProcessor避免重定向循环:

import urllib
import urllib.request
import html.parser
import requests
from requests.exceptions import HTTPError
from socket import error as SocketError
from http.cookiejar import CookieJar

try:
req=urllib.request.Request(url, None, {'User-Agent': 'Mozilla/5.0 (X11; Linux i686; G518Rco3Yp0uLV40Lcc9hAzC1BOROTJADjicLjOmlr4=) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'gzip, deflate, sdch','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'})
cj = CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
response = opener.open(req)
raw_response = response.read().decode('utf8', errors='ignore')
response.close()
except urllib.request.HTTPError as inst:
output = format(inst)
print(output)

关于python-3.x - urlopen 返回有效链接的重定向错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32569934/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com