gpt4 book ai didi

python - 统一码错误 : URL contains non-ASCII characters (Python 2. 7)

转载 作者:行者123 更新时间:2023-11-28 21:12:55 26 4
gpt4 key购买 nike

所以我设法制作了一个爬虫,我正在搜索所有链接,当我到达一个产品链接时,我做了一些发现并获取了所有产品信息,但是当它到达某个页面时,它给出了一个Unicode 错误:/

import urllib
import urlparse
from itertools import ifilterfalse
from urllib2 import URLError, HTTPError

from bs4 import BeautifulSoup

urls = ["http://www.kiabi.es/"]
visited = []


def get_html_text(url):
try:
return urllib.urlopen(current_url).read()
except (URLError, HTTPError, urllib.ContentTooShortError):
print "Error getting " + current_url


def find_internal_links_in_html_text(html_text, base_url):
soup = BeautifulSoup(html_text, "html.parser")
links = []
for tag in soup.findAll('a', href=True):
url = urlparse.urljoin(base_url, tag['href'])
domain = urlparse.urlparse(base_url).hostname
if domain in url:
links.append(url)
return links


def is_url_already_visited(url):
return url in visited


while urls:
current_url = urls.pop()
word = '#C'
if word in current_url:
[do sth]
#print "Parsing", current_url
html_text = get_html_text(current_url)
visited.append(current_url)
found_urls = find_internal_links_in_html_text(html_text, current_url)
new_urls = ifilterfalse(is_url_already_visited, found_urls)
urls.extend(new_urls)

错误:

Traceback (most recent call last):

File "<ipython-input-1-67c2b4cf7175>", line 1, in <module>
runfile('S:/Consultas_python/Kiabi.py', wdir='S:/Consultas_python')

File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)

File "C:\Anaconda2\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 71, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)

File "S:/Consultas_python/Kiabi.py", line 91, in <module>
html_text = get_html_text(current_url)

File "S:/Consultas_python/Kiabi.py", line 30, in get_html_text
return urllib.urlopen(current_url).read()

File "C:\Anaconda2\lib\urllib.py", line 87, in urlopen
return opener.open(url)

File "C:\Anaconda2\lib\urllib.py", line 185, in open
fullurl = unwrap(toBytes(fullurl))

File "C:\Anaconda2\lib\urllib.py", line 1070, in toBytes
" contains non-ASCII characters")

UnicodeError: URL u'http://www.kiabi.es/Barbapap\xe1_s1' contains non-ASCII characters

UnicodeError: URL u'http://www.kiabi.es/Petit-B\xe9guin_s2' contains non-ASCII characters

我该如何解决?

最佳答案

您需要对 unicode 字符串的 utf8 表示进行百分比编码

如解释here :

All non-ASCII code points in the IRI should next be encoded as UTF-8, and the resulting bytes percent-encoded, to produce a valid URI.

在 python 代码中,这意味着:

import urllib
url = urllib.quote(url.encode('utf8'), ':/')

quote 的第二个参数,':/',是为了防止协议(protocol)部分 http: 或路径分隔符中出现冒号/ 不被编码。

(在 Python 3 中,quote 函数已移至 urllib.parse 模块)。

关于python - 统一码错误 : URL contains non-ASCII characters (Python 2. 7),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33708059/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com