gpt4 book ai didi

python - 将 XML 非法 &char 转换为 utf8 - python

转载 作者:太空狗 更新时间:2023-10-29 15:30:27 24 4
gpt4 key购买 nike

在以下位置有一个 XML 和 HTML 字符引用列表:https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references .

然而,有些东西根本没有在该列表中定义,但它们在旧的 HTML 脚本中使用过。当我处理来自 http://www.d.umn.edu/~tpederse/data.htmlSenseval-2 格式(带有修复) 数据集时,我遇到了以下单词,它破坏了我试图使用 xml.et.elementTree 解析数据的脚本。

这些词的 unicode 等效项是什么?

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

我的脚本:

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

给出这个回溯:

Traceback (most recent call last):
File "senseval.py", line 4, in <module>
tree = et.parse(s1)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113

最佳答案

“单词”看起来格式不正确entity references .有效的实体引用在末尾有一个分号。我查看了 test-fix.xml(在 Sval1to2.fix.tar.gz 中),看起来很可能 &dash(或 &dash. ) 表示某种破折号或连字符。该文件具有 .xml 扩展名,如果修复了错误的实体引用,它将非常接近于格式良好的 XML。

在您链接到的页面上(http://www.d.umn.edu/~tpederse/data.html),它说:

Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this.

因此,尽管该文档看起来非常像 XML,但它并不是 XML,发布它的人也很清楚这一点。

关于python - 将 XML 非法 &char 转换为 utf8 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19030728/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com