gpt4 book ai didi

python - Mechanize 提交表单字符编码问题

转载 作者:太空狗 更新时间:2023-10-30 00:15:26 24 4
gpt4 key购买 nike

我正在尝试抓取 http://www.nscb.gov.ph/ggi/database.asp ,特别是您从选择市/省获得的所有表格。我将 python 与 lxml.html 和 Mechanize 一起使用。到目前为止,我的抓取工具工作正常,但是在提交市政当局 [19]“Peñarrubia,Abra”时,我收到 HTTP Error 500: Internal Server Error。我怀疑这是由于字符编码。我的猜测是 ene 字符(n 上面有波浪号)导致了这个问题。我该如何解决这个问题?

我的脚本的这一部分的工作示例如下所示。由于我刚刚开始使用 python(并且经常使用我在 SO 上找到的片段),因此非常感谢任何进一步的评论。

from BeautifulSoup import BeautifulSoup
import mechanize
import lxml.html
import csv



class PrettifyHandler(mechanize.BaseHandler):
def http_response(self, request, response):
if not hasattr(response, "seek"):
response = mechanize.response_seek_wrapper(response)
# only use BeautifulSoup if response is html
if response.info().dict.has_key('content-type') and ('html' in response.info().dict['content-type']):
soup = BeautifulSoup(response.get_data())
response.set_data(soup.prettify())
return response

site = "http://www.nscb.gov.ph/ggi/database.asp"

output_mun = csv.writer(open(r'output-municipalities.csv','wb'))
output_prov = csv.writer(open(r'output-provinces.csv','wb'))

br = mechanize.Browser()
br.add_handler(PrettifyHandler())


# gets municipality stats
response = br.open(site)
br.select_form(name="form2")
muns = br.find_control("strMunicipality2", type="select").items
# municipality #19 is not working, those before do
for pos, item in enumerate(muns[19:]):
br.select_form(name="form2")
br["strMunicipality2"] = [item.name]
print pos, item.name
response = br.submit(id="button2", type="submit")
html = response.read()
root = lxml.html.fromstring(html)
table = root.xpath('//table')[1]
data = [
[td.text_content().strip() for td in row.findall("td")]
for row in table.findall("tr")
]
print data, "\n"
for row in data[2:]:
if row:
row.append(item.name)
output_mun.writerow([s.encode('utf8') if type(s) is unicode else s for s in row])
response = br.open(site) #go back button not working

# provinces follow here

非常感谢!

编辑:具体来说,错误发生在这一行

response = br.submit(id="button2", type="submit")

最佳答案

好的,找到了。它是转换为 unicode 和美化默认返回 utf-8 的漂亮汤。你应该使用:

response.set_data(soup.prettify(encoding='latin-1'))

关于python - Mechanize 提交表单字符编码问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6610208/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com