gpt4 book ai didi

python - 保留原始文档类型和 lxml.etree 解析的 xml 的声明

转载 作者:太空狗 更新时间:2023-10-29 17:05:56 34 4
gpt4 key购买 nike

我正在使用 python 的 lxml,我正在尝试读取一个 xml 文档,修改并写回它,但是原始的 doctype 和 xml 声明消失了。我想知道是否有一种简单的方法可以通过 lxml 或其他一些解决方案将其放回原处?

最佳答案

tl;dr

# adds declaration with version and encoding regardless of
# which attributes were present in the original declaration
# expects utf-8 encoding (encode/decode calls)
# depending on your needs you might want to improve that
from lxml import etree
from xml.dom.minidom import parseString
xml1 = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE root SYSTEM "example.dtd">
<root>...</root>
'''
xml2 = '''\
<root>...</root>
'''
def has_xml_declaration(xml):
return parseString(xml).version
def process(xml):
t = etree.fromstring(xml.encode()).getroottree()
if has_xml_declaration(xml):
print(etree.tostring(t, xml_declaration=True, encoding=t.docinfo.encoding).decode())
else:
print(etree.tostring(t).decode())
process(xml1)
process(xml2)

以下将包括 DOCTYPE 和 XML 声明:

from lxml import etree
from StringIO import StringIO

tree = etree.parse(StringIO('''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''))

docinfo = tree.docinfo
print etree.tostring(tree, xml_declaration=True, encoding=docinfo.encoding)

请注意,如果您创建一个 Element(例如使用 fromstring),tostring 不会保留 DOCTYPE,它仅在您使用 parse 处理 XML 时有效。

更新:正如J.F. Sebastian 指出的那样我关于 fromstring 的断言是不正确的。

下面是一些代码,以突出显示 ElementElementTree 序列化之间的区别:

from lxml import etree
from StringIO import StringIO

xml_str = '''<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE root SYSTEM "test" [ <!ENTITY tasty "eggs"> ]>
<root>
<a>&tasty;</a>
</root>
'''

# get the ElementTree using parse
parse_tree = etree.parse(StringIO(xml_str))
encoding = parse_tree.docinfo.encoding
result = etree.tostring(parse_tree, xml_declaration=True, encoding=encoding)
print "%s\nparse ElementTree:\n%s\n" % ('-'*20, result)

# get the ElementTree using fromstring
fromstring_tree = etree.fromstring(xml_str).getroottree()
encoding = fromstring_tree.docinfo.encoding
result = etree.tostring(fromstring_tree, xml_declaration=True, encoding=encoding)
print "%s\nfromstring ElementTree:\n%s\n" % ('-'*20, result)

# DOCTYPE is lost, and no access to encoding
fromstring_element = etree.fromstring(xml_str)
result = etree.tostring(fromstring_element, xml_declaration=True)
print "%s\nfromstring Element:\n%s\n" % ('-'*20, result)

输出是:

--------------------
parse ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>

--------------------
fromstring ElementTree:
<?xml version='1.0' encoding='iso-8859-1'?>
<!DOCTYPE root SYSTEM "test" [
<!ENTITY tasty "eggs">
]>
<root>
<a>eggs</a>
</root>

--------------------
fromstring Element:
<?xml version='1.0' encoding='ASCII'?>
<root>
<a>eggs</a>
</root>

关于python - 保留原始文档类型和 lxml.etree 解析的 xml 的声明,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12966488/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com