gpt4 book ai didi

python - 当对象处于事件状态时,lxml 对象标识符似乎会被重用

转载 作者:行者123 更新时间:2023-12-01 07:34:25 25 4
gpt4 key购买 nike

我在 Ubuntu 上使用 Python 3.6.8 和 lxml-4.3.4。

我所追求的是将大型 XML 内容分解为片段文件,以便更容易工作,并保留已解析元素的源文件名和行号,以便我可以形成有用的解析时错误消息。当 XML 格式良好时,我将引发的错误特定于我的应用程序。

以下是一些示例 XML 片段文件:

one.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
<one>1</one>
<one>11</one>
<one>111</one>
<one>1111</one>
</data>

两个.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
<two>2</two>
<two>22</two>
<two>222</two>
<two>2222</two>
<two>22222</two>
<two>222222</two>
</data>

我的计划是使用 lxml 来解析每个文件,然后简单地将元素树拼接在一起以获得单个根。然后我的程序的其余部分可以消耗整个树。

如果元素的内容对我的应用程序无效,我想给出它来自的片段文件和行号。 lxml 已经有行号,但没有源文件。所以我想追踪这一点。请注意,我决定不尝试扩展 lxml 的类,而是使用元素对象标识符到片段文件的映射,我希望即使 lxml 重构其源代码,它也是持久的。

from lxml import etree

# Too much data for one source file, so let's define
# fragment files, each of which looks like a stand
# alone XML file w/ header and root <data>...</data>
# to make syntax highlighters happy.
xmlFragmentFiles = ['one.xml', 'two.xml']

# lxml tracks line number for parsed elements, but not
# source filename. Rather than try to extend the deep
# inner classes of the module, let's try keeping a map
# from parsed elements to fragment file they just came
# from.
element2fragment = {}
def AddFragmentFileToETree(element, fragmentFile):
# The entry we're just about to add.
print('%s:%s' % (id(element), fragmentFile))
element2fragment[id(element)] = fragmentFile
for child in element:
AddFragmentFileToETree(child, fragmentFile)

# Fabricate a root that we'll stitch each fragment's
# children onto as we parse them.
root = etree.fromstring('<data></data>')
AddFragmentFileToETree(root, 'Programmatic Root')

for filename in xmlFragmentFiles:
# It doesn't seem to matter whether we create a new
# parser per fragment, or reuse a single parser.
parser = etree.XMLParser(remove_comments=True)
subroot = etree.parse(filename, parser).getroot()
for child in subroot:
root.append(child)
AddFragmentFileToETree(child, filename)

# Clearly the final desired tree is here, and presumably
# all the subelements we care about are reachable from
# the programmatic root meaning the objects are still
# live, so why did any object identifier get reused?
print(etree.tostring(
root, encoding=str, pretty_print=True))

当我运行这个程序时,我可以看到整个所需的树以及片段文件的每个不同元素都带有 pretty-print 。但是,查看我们插入的映射条目,我们可以清楚地看到对象正在被重用!?

140611035114248:Programmatic Root
140611035114056:one.xml <-- see here
140611035114376:one.xml
140611035114440:one.xml
140611035114056:one.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and again
<data><one>1</one>
<one>11</one>
<one>111</one>
<one>1111</one>
<two>2</two>
<two>22</two> <-- yet all distinct elements still exist
<two>222</two>
<two>2222</two>
<two>22222</two>
<two>222222</two>
</data>

有什么关于这些对象的建议吗?也许我应该远离 lxml,它是一个 C 库?我切换到 lxml 只是为了行号跟踪。

最佳答案

我决定继续扩展/自定义解析器......并找到了这个原始问题的答案。

https://lxml.de/element_classes.html

他们警告说 python Element 代理是无状态的,

Element instances are created and garbage collected at need, so there is normally no way to predict when and how often a proxy is created for them.

他们接着说,如果你真的需要它们来携带状态,你必须为每个保留一个实时引用:

proxy_cache = list(root.iter())

这对我有用。我认为当元素具有对子元素的实时引用时,根就足够了,但代理显然是根据 C 中维护的真实树的需要出现的。

关于python - 当对象处于事件状态时,lxml 对象标识符似乎会被重用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57059021/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com