gpt4 book ai didi

python - 使用 urllib2 打开页面 - 变音符号

转载 作者:行者123 更新时间:2023-11-28 01:41:28 25 4
gpt4 key购买 nike

我正在尝试使用 urllib2 打开多个页面。问题是有些页面打不开。它返回 urllib2.HTTPerror: HTTP Error 400: Bad Request

我正在从另一个网页获取此页面的 href(在页面的头部是 charset = "utf-8")。仅当我尝试打开 url 中包含“č”、“ž”或“ř”的页面时,才会返回错误。

代码如下:

def getSoup(url):
req = urllib2.Request(url)

response = urllib2.urlopen(req)
page = response.read()
soup = BeautifulSoup(page, 'html.parser')
return soup




hovienko = getSoup("http://www.hovno.cz/hovna-az/a/1/")
lis = hovienko.find("div", class_="span12").find('ul').findAll('li')

for liTag in lis:

aTag = liTag.find('a')['href']
href = "http://www.hovno.cz"+aTag """ hrefs, I'm trying to open using urllib2 """
soup = getSoup(href.encode("iso-8859-2")) """ here occures errors when 'č','ž' or 'ř' in url """

有人知道我必须做什么才能避免错误吗?

谢谢

最佳答案

本网站是 UTF-8。为什么需要 href.encode("iso-8859-2") ?我从 http://programming-review.com/beautifulsoasome-interesting-python-functions/ 中获取了下一个代码

    import urllib2
import cgitb
cgitb.enable()
from BeautifulSoup import BeautifulSoup
from urlparse import urlparse

# print all links
def PrintLinks(localurl):
data = urllib2.urlopen(localurl).read()
print 'Encoding of fetched HTML : %s', type(data)
soup = BeautifulSoup(data)
parse = urlparse(localurl)
localurl = parse[0] + "://" + parse[1]
print "<h3>Page links statistics</h3>"
l = soup.findAll("a", attrs={"href":True})
print "<h4>Total links count = " + str(len(l)) + '</h4>'
externallinks = [] # external links list
for link in l:
# if it's external link
if link['href'].find("http://") == 0 and link['href'].find(localurl) == -1:
externallinks = externallinks + [link]
print "<h4>External links count = " + str(len(externallinks)) + '</h4>'


if len(externallinks) > 0:
print "<h3>External links list:</h3>"
for link in externallinks:
if link.text != '':
print '<h5>' + link.text.encode('utf-8')
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'
else:
print '<h5>' + '[image]',
print ' => [' + '<a href="' + link['href'] + '" >' + link['href'] + '</a>' + ']' + '</h5>'


PrintLinks( "http://www.zlatestranky.cz/pro-mobily/")

关于python - 使用 urllib2 打开页面 - 变音符号,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25746668/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com