gpt4 book ai didi

用于抓取 html 输出的 Python 脚本有时有效,有时无效

转载 作者:行者123 更新时间:2023-12-01 05:16:56 32 4
gpt4 key购买 nike

我正在尝试使用以下 python 代码从雅虎的搜索结果中抓取链接。我使用 mechanize 作为浏览器实例,使用 Beautiful soup 来解析 HTML 代码。

问题是,这个脚本有时可以正常工作,有时会抛出以下错误:

WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.

很明显,我猜它与编码和解码或 gzip 压缩有关,但为什么有时工作有时不工作?以及如何让它始终工作?

以下是代码。运行 7-8 次你就会注意到。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import mechanize
import urllib
from bs4 import BeautifulSoup
import re

#mechanize emulates a Browser
br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders = [('User-agent','chrome')]

term = "stock market".replace(" ","+")
query = "https://search.yahoo.com/search?q=" + term

htmltext = br.open(query).read()
htm = str(htmltext)

soup = BeautifulSoup(htm)
#Since all results are located in the ol tag
search = soup.findAll('ol')

searchtext = str(search)

#Using BeautifulSoup to parse the HTML source
soup1 = BeautifulSoup(searchtext)
#Each search result is contained within div tag
list_items = soup1.findAll('div', attrs={'class':'res'})


#List of first search result
list_item = str(list_items)

for li in list_items:
list_item = str(li)
soup2 = BeautifulSoup(list_item)
link = soup2.findAll('a')
print link[0].get('href')
print ""

这是输出屏幕截图: http://pokit.org/get/img/1d47e0d0dc08342cce89bc32ae6b8e3c.jpg

最佳答案

我在一个项目上遇到了编码问题,并开发了一个函数来获取我正在抓取的页面的编码 - 然后您可以为您的函数解码为 un​​icode 以尝试防止这些错误。使用 re:压缩,您需要做的是开发代码,以便在遇到压缩文件时可以处理它。

from bs4 import BeautifulSoup, UnicodeDammit
import chardet
import re

def get_encoding(soup):
"""
This is a method to find the encoding of a document.
It takes in a Beautiful soup object and retrieves the values of that documents meta tags
it checks for a meta charset first. If that exists it returns it as the encoding.
If charset doesnt exist it checks for content-type and then content to try and find it.
"""
encod = soup.meta.get('charset')
if encod == None:
encod = soup.meta.get('content-type')
if encod == None:
content = soup.meta.get('content')
match = re.search('charset=(.*)', content)
if match:
encod = match.group(1)
else:
dic_of_possible_encodings = chardet.detect(unicode(soup))
encod = dic_of_possible_encodings['encoding']
return encod

处理压缩数据的链接http://www.diveintopython.net/http_web_services/gzip_compression.html

来自这个问题Check if GZIP file exists in Python

if any(os.path.isfile, ['bob.asc', 'bob.asc.gz']):
print 'yay'

关于用于抓取 html 输出的 Python 脚本有时有效,有时无效,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22997549/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com