gpt4 book ai didi

python - HTML编码和lxml解析

转载 作者:太空狗 更新时间:2023-10-29 19:36:02 27 4
gpt4 key购买 nike

我正在尝试最终解决因尝试使用 lxml 抓取 HTML 而出现的一些编码问题。以下是我遇到的三个示例 HTML 文档:

1.

<!DOCTYPE html>
<html lang='en'>
<head>
<title>Unicode Chars: 은 —’</title>
<meta charset='utf-8'>
</head>
<body></body>
</html>

2.

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="ko-KR" lang="ko-KR">
<head>
<title>Unicode Chars: 은 —’</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
</head>
<body></body>
</html>

3.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Unicode Chars: 은 —’</title>
</head>
<body></body>
</html>

我的基本脚本:

from lxml.html import fromstring
...

doc = fromstring(raw_html)
title = doc.xpath('//title/text()')[0]
print title

结果是:

Unicode Chars: ì ââ
Unicode Chars: 은 —’
Unicode Chars: 은 —’

因此,样本 1 和缺失的 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 显然存在问题标签。来自 here 的解决方案将正确地将示例 1 识别为 utf-8,因此它在功能上等同于我的原始代码。

lxml 文档出现冲突:

来自 here该示例似乎建议我们应该使用 UnicodeDammit 将标记编码为 un​​icode。

from BeautifulSoup import UnicodeDammit

def decode_html(html_string):
converted = UnicodeDammit(html_string, isHTML=True)
if not converted.unicode:
raise UnicodeDecodeError(
"Failed to detect encoding, tried [%s]",
', '.join(converted.triedEncodings))
# print converted.originalEncoding
return converted.unicode

root = lxml.html.fromstring(decode_html(tag_soup))

但是here它说:

[Y]ou will get errors when you try [to parse] HTML data in a unicode string that specifies a charset in a meta tag of the header. You should generally avoid converting XML/HTML data to unicode before passing it into the parsers. It is both slower and error prone.

如果我尝试遵循 lxml 文档中的第一个建议,我的代码现在是:

from lxml.html import fromstring
from bs4 import UnicodeDammit
...
dammit = UnicodeDammit(raw_html)
doc = fromstring(dammit.unicode_markup)
title = doc.xpath('//title/text()')[0]
print title

我现在得到以下结果:

Unicode Chars: 은 —’
Unicode Chars: 은 —’
ValueError: Unicode strings with encoding declaration are not supported.

示例 1 现在可以正常工作,但示例 3 由于 <?xml version="1.0" encoding="utf-8"?> 而导致错误标签。

是否有正确的方法来处理所有这些情况?是否有比以下更好的解决方案?

dammit = UnicodeDammit(raw_html)
try:
doc = fromstring(dammit.unicode_markup)
except ValueError:
doc = fromstring(raw_html)

最佳答案

lxmlseveral issues与处理 Unicode 相关。在明确指定字符编码时最好使用字节(目前):

#!/usr/bin/env python
import glob
from lxml import html
from bs4 import UnicodeDammit

for filename in glob.glob('*.html'):
with open(filename, 'rb') as file:
content = file.read()
doc = UnicodeDammit(content, is_html=True)

parser = html.HTMLParser(encoding=doc.original_encoding)
root = html.document_fromstring(content, parser=parser)
title = root.find('.//title').text_content()
print(title)

输出

Unicode Chars: 은 —’
Unicode Chars: 은 —’
Unicode Chars: 은 —’

关于python - HTML编码和lxml解析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15302125/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com