python - xml.etree.ElementTree 与 lxml.etree : different internal node representation?-6ren

python - xml.etree.ElementTree 与 lxml.etree : different internal node representation?

转载作者：太空狗更新时间：2023-10-29 17:47:31

29

4

我一直在将我的一些原始 xml.etree.ElementTree (ET) 代码转换为 lxml.etree (lxmlET )。幸运的是，两者之间有很多相似之处。但是，我确实偶然发现了一些我在任何文档中都找不到的奇怪行为。它考虑了后代节点的内部表示。

在 ET 中，iter() 用于迭代元素的所有后代，可选择按标签名称进行过滤。因为我在文档中找不到关于此的任何详细信息，所以我希望 lxmlET 有类似的行为。问题是，从测试中我得出结论，在 lxmlET 中，树有不同的内部表示。

在下面的示例中，我遍历树中的节点并打印每个节点的子节点，但此外我还创建了这些子节点的所有不同组合并打印了它们。这意味着，如果一个元素有子元素 ('A', 'B', 'C') 我创建更改，即树 [('A'), ('A', ' B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')].

# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            new_combo_tree = ET.Element(tree.tag, tree.attrib)
            for recombined_child in combination:
                new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            yield new_combo_tree

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            for combo_tree in get_combination_trees(subnode):
                combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))    

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

此处的预期输出是每个节点的子节点的 id 与连字符连接在一起，以及子节点的所有可能组合(参见上文)以自上而下的广度优先方式。

2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

但是，当您使用 lxml 模块而不是 xml 时(取消注释 lxmlET 的导入并注释 ET 的导入)，然后运行您将看到的代码输出是

2-3
2
3
node 2 has no children

因此永远不会访问更深的后代节点。这可以通过以下任一方式规避:

使用deepcopy(注释/取消注释get_combination_trees()中的相关部分)，或
在 parse_xml() 中使用 for subnode in node.xpath('.//node') 而不是 iter()。

所以我知道有办法解决这个问题，但我主要想知道发生了什么？!我花了很长时间调试它，但我找不到任何相关文档。发生了什么，这两个模块之间实际的根本区别是什么？在处理非常大的树时，最有效的解决方法是什么？

最佳答案

虽然 Louis 的回答是正确的，而且我完全同意在遍历数据结构时修改数据结构通常是个坏主意^(tm)，但您也问过为什么代码与 xml.etree 一起工作.ElementTree 而不是 lxml.etree 对此有一个非常合理的解释。

`xml.etree.ElementTree`中`.append`的实现

此库直接在 Python 中实现，并且可能会因您使用的 Python 运行时而异。假设您使用的是 CPython，您正在寻找的实现已实现 in vanilla Python :

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

最后一行是我们唯一关心的部分。事实证明，self._children 已初始化 towards the top of that file作为:

self._children = []

因此，向树中添加一个子项只是将一个元素附加到列表中。直觉上，这正是您正在寻找的(在本例中)，并且实现的行为方式完全不足为奇。

在`lxml.etree`中实现`.append`

lxml 是作为 Python、重要的 Cython 和 C 代码的混合实现的，因此探索它比纯 Python 实现要困难得多。首先，.append is implemented as :

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild 在 apihelper.pxi 中实现:

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodeToDocument(parent._doc, c_source_doc, c_node)
    return 0

这里肯定还有更多内容。特别是，lxml 显式地从树中删除节点，然后将其添加到别处。这可以防止您在操作节点时意外创建循环 XML 图(您可能可以使用 xml.etree 版本)。

`lxml` 的解决方法

现在我们知道 xml.etree copy 追加节点但 lxml.etree 移动它们，为什么这些解决方法有效？基于 tree.xmlUnlinkNode 方法(实际上是 defined in C inside of libxml2 )，取消链接只会弄乱一堆指针。因此，任何复制节点元数据的东西都可以解决问题。因为我们关心的所有元数据都是 the xmlNode struct 上的直接字段，任何浅复制节点都可以达到目的

copy.deepcopy() 绝对有效
node.xpath 返回节点 wrapped in proxy elements这恰好是浅拷贝树元数据
copy.copy()也有窍门
如果您不需要您的组合实际位于官方树中，设置 new_combo_tree = [] 也可以像 xml.etree 一样为您添加列表。

如果您真的很关心性能和大型树，我可能会从使用 copy.copy() 的浅复制开始，尽管您绝对应该分析几个不同的选项并看看哪个有效最适合你。

关于python - xml.etree.ElementTree 与 lxml.etree : different internal node representation?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50749937/

29

4

0

文章推荐： c# - 有关在 ASP.NET 应用程序中缓存的最佳实践

文章推荐： python - Python如何将bytes转为float？

文章推荐： c# - 我可以将二进制文件放入标准输入吗？ C#

python - ElementTree 返回元素而不是 ElementTree
我正在尝试从字符串构建ElementTree。当我执行以下操作时(如 Python ElementTree: Parsing a string and getting ElementTree inst
python - 为什么 elementtree.ElementTree.iterparse 使用这么多内存？
我正在使用 elementtree.ElementTree.iterparse 来解析大型 (371 MB) xml 文件。我的代码基本上是这样的: outf = open('out.txt', '
Python:忽略 elementtree.ElementTree 中的 xmlns
有没有办法在 elementtree.ElementTree 中忽略标记名称中的 XML 命名空间？我尝试打印所有 technicalContact 标签: for item in root.get
python: xml.etree.elementtree.ElemenTtree.write() 声明标签
我使用 xml.etree.elementtree.Element 创建了一个 XML 文档，并想使用 ElementTree.write() 函数打印它但是出来的声明标签是虽然我需要用双引号引起
python - 将重音字符转换为拉丁字符而不影响 ElementTree
这个问题已经有答案了: What is the best way to remove accents (normalize) in a Python unicode string? (14 个回答)
python - ElementTree - 将子元素附加到元素时出现问题
我想为此处元素国家/地区新加坡旁边的元素创建子元素。假设我的 test.xml 文件如下所示 2008 141100
Python开发-elementtree XML和字符串操作
我正在使用ElementTree加载一系列 XML 文件并解析它们。解析文件时，我将从其中获取一些数据(标题和文本段落)。然后我需要获取一些存储在 XML 中的文件名。它们包含在名为 ContentI
Python ElementTree 复制带有子节点的节点
我必须将多个 XML 文件合并为一个。此外，新文件的结构也不同。这是我的“旧”结构: 1
python - Elementtree，检查元素是否有特定的父元素？
我正在解析一个 xml 文件:http://pastebin.com/fw151jQN我希望在副本中读取它的大部分内容并将其写入一个新文件，其中一些已修改，很多未修改，还有很多被忽略。作为初始阶段，我
Python elementtree 很难提取数据
这是 XML: TARGET_NAME_1 5 a string goes here TARGET_NA
python - Elementtree 转储给出错误答案
from lxml import etree from xml.etree.ElementTree import Element, SubElement, dump listing = Element
python导入xml不包含xml.etree.ElementTree
当涉及到模块/库时，为了可读性，我喜欢在 python 中使用完整的命名空间。我想知道为什么这对 xml 库不起作用。我认为 import xml 还将导入 etree 和命名空间中的所有其他内容。至
python - ElementTree 删除元素
这里是 Python 菜鸟。想知道删除所有 updated 属性值为 true 的“profile”标签的最干净、最好的方法是什么。我已经尝试了下面的代码，但它抛出了:SyntaxError("ca
Python ElementTree 从根目录中删除元素时出错
尝试从 xml 文档中删除元素时出现以下错误。“ValueError: list.remove(x): x 不在列表中”这是代码，错误发生在删除的行上。 import xml.etree.Elemen
Python ElementTree 重复检查器
所以我必须编写一个“重复检查器”来比较两个 XML，看看它们是否相同(包含相同的数据)。现在因为它们来自同一个类并且是从 XSD 结构中生成的，所以内部元素的顺序很可能是相同的。我能想到的进行重复检
Python ElementTree 编写多个命名空间
我有一个 XML 文档，我正在使用 ElementTree 阅读和附加该文档。这有多个命名空间声明。据我所知，ElementTree 只允许声明一个全局命名空间: ET.register_namesp
python/elementtree xml解析成数组
从这里开始: stuff
Python ElementTree 发现不工作
我是 ElementTree 的新手。我正在尝试获取来自 XML 响应的值。以下代码对我不起作用。如何提取中的值？我不确定号码在哪里 53是从这里来的。 ... r = req
Python ElementTree 不喜欢处理指令名称中的冒号
以下代码: import xml.etree.ElementTree as ET xml = '''\ ''' root = ET.fromstring(xml)
python - ElementTree 命名空间不方便
我无法控制我获得的 XML 的质量。在某些情况下是: ... 在其他方面我得到: ... 我想我也应该处理 ... 整个架构都是相同的，我只需要一个解析器来处理它。我该如何处理所有这些

首页

博学

6Ren·AI

商城