gpt4 book ai didi

python - lxml element.clear() 和访问子元素

转载 作者:数据小太阳 更新时间:2023-10-29 02:30:59 24 4
gpt4 key购买 nike

我正在使用 lxml.iterparse 来解析一个相当大的 xml 文件。在某个时刻抛出内存不足异常。我知道类似的问题,并且有一个构建的树,当您不再使用它时,您通常应该使用 element.clear() 清除它。

我的代码看起来像这样(缩短):

for  event,element in context :
if element.tag == xmlns + 'initialized':
attributes = element.findall(xmlns+'attribute')
heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
characteristics['max_heap_size_MB'] = bytes_to_MB(int(heapsize, 16))

#clear up the built tree to avoid mem alloc fails
element.clear()
del context

如果我注释掉 element.clear(),这会起作用。如果我使用的是 element.clear,我会得到这样的 Keyerrors:

Traceback (most recent call last):
File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 289, in <module>
main()
File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 277, in main
join_characteristics_and_score(logpath, benchmarkscores)
File "C:\Users\NN\Documents\scripts\analyse\analyse_all.py", line 140, in join_characteristics_and_score
parsed_verbose_xml = parse_xml(verbose)
File "C:\Users\NN\Documents\scripts\analyse\analyze_g.py", line 62, in parse_xml
heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
File "C:\Users\NN\Documents\scripts\analyse\analyze_g.py", line 62, in <lambda>
heapsize = filter(lambda x:x.attrib['name']=='maxHeapSize', attributes)[0].attrib['value']
File "lxml.etree.pyx", line 2272, in lxml.etree._Attrib.__getitem__ (src\lxml\lxml.etree.c:54751)
KeyError: 'name'

当我打印元素时,它们是带有值的常规字典,无需使用 element.clear()。清除时,那些字典是空的。

编辑

说明问题的最小运行 python 程序:

#!/usr/bin/python

from lxml import etree
from pprint import pprint

def fast_iter(context, func, *args, **kwargs):
# http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
# Author: Liza Daly
for event, elem in context:
func(elem, *args, **kwargs)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context

def process_element(elem):
xmlns = "{http://www.ibm.com/j9/verbosegc}"

if elem.tag == xmlns + "gc-start":
memelements = elem.findall('.//root:mem', namespaces = {'root':xmlns[1:-1]})
pprint(memelements)

if __name__ == '__main__':
with open('small.xml', "r+") as xmlf:
context = etree.iterparse(xmlf)
fast_iter(context, process_element)

xml文件内容如下:

<verbosegc xmlns="http://www.ibm.com/j9/verbosegc">
<gc-start id="5" type="scavenge" contextid="4" timestamp="2013-06-14T15:48:46.815">
<mem-info id="6" free="3048240" total="4194304" percent="72">
<mem type="nursery" free="0" total="1048576" percent="0">
<mem type="allocate" free="0" total="524288" percent="0" />
<mem type="survivor" free="0" total="524288" percent="0" />
</mem>
<mem type="tenure" free="3048240" total="3145728" percent="96">
<mem type="soa" free="2891568" total="2989056" percent="96" />
<mem type="loa" free="156672" total="156672" percent="100" />
</mem>
<remembered-set count="1593" />
</mem-info>
</gc-start>
</verbosegc>

最佳答案

Liza Daly 写了一篇关于 processing large XML using lxml 的精彩文章.尝试此处提供的 fast_iter 代码:

import lxml.etree as ET
import pprint


def fast_iter(context, func, *args, **kwargs):
"""
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ (Liza Daly)
See also http://effbot.org/zone/element-iterparse.htm
"""
for event, elem in context:
func(elem, *args, **kwargs)
# It's safe to call clear() here because no descendants will be
# accessed
elem.clear()
# Also eliminate now-empty references from the root node to elem
# (ancestor loop added by unutbu)
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
del context


def process_element(elem, namespaces):
memelements = elem.findall('.//root:mem', namespaces=namespaces)
pprint.pprint(memelements)

if __name__ == '__main__':
xmlns = "http://www.ibm.com/j9/verbosegc"
namespaces = {'root': xmlns}
with open('small.xml', "r+") as xmlf:
context = ET.iterparse(xmlf, events=('end', ),
tag='{{{}}}gc-start'.format(xmlns))
fast_iter(context, process_element, namespaces)

关于python - lxml element.clear() 和访问子元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16724033/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com