gpt4 book ai didi

python - 在 Python/元素树中从 300MG Xml 中删除元素

转载 作者:行者123 更新时间:2023-12-01 04:27:46 24 4
gpt4 key购买 nike

我正在尝试根据 Can Python xml ElementTree parse a very large xml file? 等建议解析 ElementTree 中的 300MB XML。

from xml.etree import ElementTree as Et

for event, elem in Et.iterparse('C:\...path...\desc2015.xml'):
if elem.tag == 'DescriptorRecord':
for e in elem._children:
if str(e.tag) in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
e.clear()
elem.remove(e)
print 'removed %s' % e

给予...

removed <Element 'HistoryNote' at 0x557cc7f0>
removed <Element 'DateCreated' at 0x557fa990>
removed <Element 'HistoryNote' at 0x55809af0>
removed <Element 'DateCreated' at 0x5580f5d0>

但是,这种情况一直持续下去,文件并没有变小,并且经过检查,元素仍然存在。尝试了 e.clear() 或 elem.remove(e),但结果相同。问候

更新

我对 @alexanderlukanin13 的回答的第一条评论中的错误代码:

回溯(最近一次调用最后一次):文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py”,第 1570 行,在trace_dispatch回溯(最近一次调用最后):文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py”,第2278行,在globals = debugger.run(setup [ '文件'],无,无)文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd.py”,第 1704 行,运行 pydev_imports.execfile( file, globals, locals) # 执行脚本 File "C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py",第 234 行,在 main() 文件中"C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\runfiles.py", line 78, in main return pydev_runfiles.main(configuration) # 注意:仍然没有返回一个正确的值。文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py”,第 835 行,在主 PydevTestRunner(configuration).run_tests() 文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py”,第 762 行,在 run_tests file_and_modules_and_module_name = self.find_modules_from_files(files) 文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py”,第 517 行,在 find_modules_from_files mod = self.__get_module_from_str(import_str, print_exception, pyfile) 文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydev_runfiles.py”,第 476 行,在 __get_module_from_str buf_err = pydevd_io.StartRedirect(keep_original_redirection=True, std='stderr') 文件“C:\Users\Eddie\Downloads\eclipse\plugins\org.python.pydev_4.0.0.201504132356\pysrc\pydevd_io.py",第 72 行,在 StartRedirect 中导入 sys MemoryError

最佳答案

脚本中的主要问题是您没有将更改后的 XML 保存回磁盘。您需要存储对根元素的引用,然后调用 ElementTree.write :

from xml.etree import ElementTree as Et

context = Et.iterparse('input.xml')
root = None
for event, elem in context:
if elem.tag == 'DescriptorRecord':
for e in list(elem.getchildren()): # Don't use _children, it's a private field
if e.tag in ['DateCreated', 'Year', 'Month', 'TreeNumber', 'HistoryNote', 'PreviousIndexing']:
elem.remove(e) # You need remove(), not clear()
root = elem

with open('output.xml', 'wb') as file:
Et.ElementTree(root).write(file, encoding='utf-8', xml_declaration=True)

注意:这里我使用一种尴尬(而且可能不安全)的方式来获取根元素 - 我假设它始终是 iterparse 输出中的最后一个元素。如果有人知道更好的方法,请告诉。

关于python - 在 Python/元素树中从 300MG Xml 中删除元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32863031/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com