gpt4 book ai didi

python - utf-8编码的html文件中包含非utf-8字符怎么办?

转载 作者:太空宇宙 更新时间:2023-11-04 01:36:59 28 4
gpt4 key购买 nike

我正在尝试使用 BeautifulSoup 来解析以 UTF-8 编码的 html 文件。但不幸的是,这个 html 文件包含一些非 utf-8 字符,因此无法正确显示。但这对我来说没问题,因为我可以简单地跳过这些字符。

问题是,即使我直接将 encodingFrom 指定为 utf-8:

soup = BeautifulSoup (html,fromEncoding='utf-8')

事实证明 soup.originalEncoding 自动设置为默认的 windows-1252。

print soup.originalEncoding
windows-1252

我引用了 BeautifulSoup 文档,它是这样写的:

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

- An encoding you pass in as the fromEncoding argument to the soup
constructor.
- An encoding discovered in the document itself
- An encoding sniffed by looking at the first few bytes of the file. If
an encoding is detected at this stage, it will be one of the UTF-*
encodings, EBCDIC, or ASCII.
- An encoding sniffed by the chardet library, if you have it installed.
- UTF-8
- Windows-1252

看来它应该使用我指定的 fromEncoding 而不是落到列表中的最后一个。

这里是 the original html I'm parsing供大家引用。

最佳答案

如果您知道文件的编码方式,请尝试在将字符串传递给 BeautifulSoup 之前对其进行解码,并明确忽略非 utf8 字符。

unicode_html = myfile.read().decode('utf-8', 'ignore')
soup = BeautifulSoup (unicode_html)

关于python - utf-8编码的html文件中包含非utf-8字符怎么办?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8912980/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com