gpt4 book ai didi

python - 良好的 python XML 解析器,可处理命名空间繁重的文档

转载 作者:太空狗 更新时间:2023-10-29 20:46:31 26 4
gpt4 key购买 nike

Python elementTree 似乎无法与命名空间一起使用。我有什么选择?BeautifulSoup 的命名空间也很垃圾。我不想剥离它们。

特定 python 库如何获取命名空间元素及其集合的示例都是 +1。

编辑:您能否使用您选择的库提供代码来处理这个真实世界的用例?

您将如何获取字符串 'Line Break'、'2.6' 和列表 ['PYTHON'、'XML'、'XML-NAMESPACES']

<?xml version="1.0" encoding="UTF-8"?>
<zs:searchRetrieveResponse
xmlns="http://unilexicon.com/vocabularies/"
xmlns:zs="http://www.loc.gov/zing/srw/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:lom="http://ltsc.ieee.org/xsd/LOM">
<zs:records>
<zs:record>
<zs:recordData>
<srw_dc:dc xmlns:srw_dc="info:srw/schema/1/dc-schema">
<name>Line Break</name>
<dc:title>Processing XML namespaces using Python</dc:title>
<dc:description>How to get contents string from an element,
how to get a collection in a list...</dc:description>
<lom:metaMetadata>
<lom:identifier>
<lom:catalog>Python</lom:catalog>
<lom:entry>2.6</lom:entry>
</lom:identifier>
</lom:metaMetadata>
<lom:classification>
<lom:taxonPath>
<lom:taxon>
<lom:id>PYTHON</lom:id>
</lom:taxon>
</lom:taxonPath>
</lom:classification>
<lom:classification>
<lom:taxonPath>
<lom:taxon>
<lom:id>XML</lom:id>
</lom:taxon>
</lom:taxonPath>
</lom:classification>
<lom:classification>
<lom:taxonPath>
<lom:taxon>
<lom:id>XML-NAMESPACES</lom:id>
</lom:taxon>
</lom:taxonPath>
</lom:classification>
</srw_dc:dc>
</zs:recordData>
</zs:record>
<!-- ... more records ... -->
</zs:records>
</zs:searchRetrieveResponse>

最佳答案

lxml是命名空间感知的。

>>> from lxml import etree
>>> et = etree.XML("""<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz /></bar></root>""")
>>> etree.tostring(et, encoding=str) # encoding=str only needed in Python 3, to avoid getting bytes
'<root xmlns="foo" xmlns:stuff="bar"><bar><stuff:baz/></bar></root>'
>>> et.xpath("f:bar", namespaces={"b":"bar", "f": "foo"})
[<Element {foo}bar at ...>]

编辑:在你的例子中:

from lxml import etree

# remove the b prefix in Python 2
# needed in python 3 because
# "Unicode strings with encoding declaration are not supported."
et = etree.XML(b"""...""")

ns = {
'lom': 'http://ltsc.ieee.org/xsd/LOM',
'zs': 'http://www.loc.gov/zing/srw/',
'dc': 'http://purl.org/dc/elements/1.1/',
'voc': 'http://www.schooletc.co.uk/vocabularies/',
'srw_dc': 'info:srw/schema/1/dc-schema'
}

# according to docs, .xpath returns always lists when querying for elements
# .find returns one element, but only supports a subset of XPath
record = et.xpath("zs:records/zs:record", namespaces=ns)[0]
# in this example, we know there's only one record
# but else, you should apply the following to all elements the above returns

name = record.xpath("//voc:name", namespaces=ns)[0].text
print("name:", name)

lom_entry = record.xpath("zs:recordData/srw_dc:dc/"
"lom:metaMetadata/lom:identifier/"
"lom:entry",
namespaces=ns)[0].text

print('lom_entry:', lom_entry)

lom_ids = [id.text for id in
record.xpath("zs:recordData/srw_dc:dc/"
"lom:classification/lom:taxonPath/"
"lom:taxon/lom:id",
namespaces=ns)]

print("lom_ids:", lom_ids)

输出:

name: Frank Malina
lom_entry: 2.6
lom_ids: ['PYTHON', 'XML', 'XML-NAMESPACES']

关于python - 良好的 python XML 解析器,可处理命名空间繁重的文档,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3785629/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com