gpt4 book ai didi

python - xml.etree.ElementTree.ParseError : not well-formed (invalid token) 错误

转载 作者:太空宇宙 更新时间:2023-11-04 02:23:21 26 4
gpt4 key购买 nike

使用 Python 3

我们得到的错误:

File "C:/scratch.py", line 27, in run
tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
File "C:\Programs\Python\Python36-32\lib\xml\etree\ElementTree.py", line 1314, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 163, column 1106

我们的代码:

tree = ET.fromstring(responses[0].decode(), ET.XMLParser(encoding='utf-8'))
for i in tree.iter('item'):
try:
title = i.find('title').text
except Exception:
pass

responses[0] 来自返回的 url get 请求列表,但在索引 0 的情况下,测试一个特定的 url:http://feeds.feedburner.com/marginalrevolution/feed

我们能够将 XML 代码插入 W3 School 验证器并获得:

This page contains the following errors:
error on line 163 at column 31: Input is not in proper UTF-8, indicate encoding! Bytes: 0x0C 0x66 0x69 0x67

但是有了 ET.XMLParser(encoding='utf-8') 属性,这不应该修复解析时的错误吗?

最佳答案

W3 Schools 验证程序的错误消息具有误导性。 0x0c 的问题不是它是无效的 UTF-8,而是它不是 legal character。在 XML 中。

0x0cform feed 控制字符,因此它在文档中的存在没有用处。符合规范的 XML 解析器有义务拒绝格式不正确的文档,并且您不能更改 rss 提要,因此最简单的解决方案是在处理之前将其从文档中删除。

>>> tree = ET.fromstring(original_response, ET.XMLParser(encoding='utf-8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/xml/etree/ElementTree.py", line 1315, in XML
parser.feed(text)
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 185, column 1106

>>> fixed = original_response.replace(b'\x0c', b'')
>>> tree = ET.fromstring(fixed, ET.XMLParser(encoding='utf-8'))
>>> tree
<Element 'rss' at 0x7ff316db6278>

关于python - xml.etree.ElementTree.ParseError : not well-formed (invalid token) 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51049975/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com