gpt4 book ai didi

python - UnicodeEncodeError : 'ascii' codec can't encode character '\xe9' - -when using urlib.请求python3

转载 作者:太空宇宙 更新时间:2023-11-03 14:24:29 25 4
gpt4 key购买 nike

我正在编写一个脚本,用于访问链接列表并解析信息。

它适用于大多数网站,但在某些网站上令人窒息“UnicodeEncodeError:‘ascii’编解码器无法对位置 13 中的字符‘\xe9’进行编码:序号不在范围内 (128)”

它在 client.py 上停止,它是 python3 上 urlib 的一部分

确切的链接是: http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

这里有很多类似的帖子,但似乎没有一个答案对我有用。

我的代码是:

from urllib import request

def __request(link,debug=0):

try:
html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
if debug:
print('The server couldn\'t fulfill the request for ' + link)
print('Error code: ', e.code)
return ''
except URLError as e:
if isinstance(e.reason, socket.timeout):
print('timeout')
return ''
else:
return unicode_html

调用请求函数

链接 = ' http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'页面 = __请求(链接)

回溯是:

Traceback (most recent call last):
File "<string>", line 250, in run_nodebug
File "C:\reader\get_news.py", line 276, in <module>
main()
File "C:\reader\get_news.py", line 255, in main
body = get_article_body(item['link'],debug=0)
File "C:\reader\get_news.py", line 155, in get_article_body
page = __request('na',url)
File "C:\reader\get_news.py", line 50, in __request
html = request.urlopen(link, timeout=35).read()
File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
return opener.open(url, data, timeout)
File "C:\Python33\Lib\urllib\request.py", line 469, in open
response = self._open(req, data)
File "C:\Python33\Lib\urllib\request.py", line 487, in _open
'_open', req)
File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
result = func(*args)
File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "C:\Python33\Lib\http\client.py", line 1061, in request
self._send_request(method, url, body, headers)
File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
self.putrequest(method, url, **skips)
File "C:\Python33\Lib\http\client.py", line 953, in putrequest
self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)

感谢任何帮助这让我发疯,我想我已经尝试了 x.decode 和类似的所有组合

(如果可能的话,我可以忽略违规字符。)

最佳答案

使用 percent-encoded URL :

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

我通过将浏览器指向

找到了上述百分比编码的 URL
http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

进入页面,然后复制粘贴将浏览器提供的编码 url 返回到文本编辑器中。但是,您可以使用以下代码以编程方式生成百分比编码的 URL:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

产生

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

关于python - UnicodeEncodeError : 'ascii' codec can't encode character '\xe9' - -when using urlib.请求python3,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22734464/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com