gpt4 book ai didi

python - XML 解析忽略文本

转载 作者:行者123 更新时间:2023-12-01 08:26:37 26 4
gpt4 key购买 nike

我遇到了以下问题,试图从 python 中的一堆 xml 文件中获取信息。我没有做任何特别的事情,例如:

import xml.etree.ElementTree as ET

root = ET.parse(r'C:\Documents\XMLfolder\file.xml').getroot()
info = root.find('foo').find('bar').find('info').text

这适用于我拥有的大部分信息 - 但 xml 的一部分采用以下格式:

<bar>
<info id="1"><label>1</label>SampleTextHere</info>
</bar>

上面的代码给出 None - 我可以找到info元素和 label不过,元素。我只是找不到文字。如果我编辑文件以删除 <label> 1 </label>然后上面的代码返回我需要的文本。

是否有一些我不知道的非常基本的东西可以让我访问文本而无需修改所有 xml 文件以删除标签? (这是相关的)。

谢谢!

最佳答案

来自[Python 3]: xml.etree.ElementTree.Element.text (强调是我的):

These attributes can be used to hold additional data associated with the element. Their values are usually strings but may be any application-specific object. If the element is created from an XML file, the text attribute holds either the text between the element’s start tag and its first child or end tag, or None, and the tail attribute holds either the text between the element’s end tag and the next tag, or None.

...

To collect the inner text of an element, see itertext(), for example "".join(element.itertext()).

我根据您的规范创建了 3 个文件:

  • file0.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <foo>
    <bar>
    <info id="1">SampleTextHere 0</info>
    </bar>
    </foo>
    </root>
  • file1.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <foo>
    <bar>
    <info id="1"><label>LabelText</label>SampleTextHere 1</info>
    </bar>
    </foo>
    </root>
  • file2.xml:

    <?xml version="1.0" encoding="UTF-8"?>
    <root>
    <foo>
    <bar>
    <info id="1"></info>
    </bar>
    </foo>
    </root>

以及一些示例代码。

code.py:

#!/usr/bin/env python3

import sys
import xml.etree.ElementTree as ET


def main():
file_names = [
"file0.xml",
"file1.xml",
"file2.xml",
]

for file_name in file_names:
root = ET.parse(file_name).getroot()
info_node = root.find("foo").find("bar").find("info")
text = info_node.text
tail = info_node.tail
iter_text = "".join(info_node.itertext())
info_node_text = text or ""
if not info_node_text:
for info_node_text in info_node.itertext():
pass
print("\n{:s}\n Text (for debugging purposes): [{:}]\n Tail (for debugging purposes): [{:}]\n Iter text (for debugging purposes): [{:s}]\n Value: [{:s}]".format(
file_name, text, tail, iter_text, info_node_text))


if __name__ == "__main__":
print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
main()

算法很简单:如果节点没有设置 text 属性,则迭代其 itertext() 并选择最后一个值,作为 >标签(或任何其他)子节点位于文本之前。

输出:

(py_064_03.06.08_test0) e:\Work\Dev\StackOverflow\q054197111>"e:\Work\Dev\VEnvs\py_064_03.06.08_test0\Scripts\python.exe" code.py
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] on win32


file0.xml
Text (for debugging purposes): [SampleTextHere 0]
Tail (for debugging purposes): [
]
Iter text (for debugging purposes): [SampleTextHere 0]
Value: [SampleTextHere 0]

file1.xml
Text (for debugging purposes): [None]
Tail (for debugging purposes): [
]
Iter text (for debugging purposes): [LabelTextSampleTextHere 1]
Value: [SampleTextHere 1]

file2.xml
Text (for debugging purposes): [None]
Tail (for debugging purposes): [
]
Iter text (for debugging purposes): []
Value: []

关于python - XML 解析忽略文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54197111/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com