gpt4 book ai didi

python - 在 Python 中清理大型 XML 文件(流解析)

转载 作者:太空宇宙 更新时间:2023-11-04 05:41:40 25 4
gpt4 key购买 nike

我尝试使用 Python 清理一些杂乱的 XML 文件,它做了三件事:

  1. 将 40%-50% 的标签名称从大写转换为小写
  2. 删除标签之间的 NULL
  3. 删除标签之间的空行

我在使用 BeautifulSoup 时这样做了,但是,由于我的一些 XML 文件超过 1GB,我遇到了内存问题。相反,我查看了一些流方法,如 xml.sax,但我并没有完全理解该方法。那么有人可以给我一些建议吗?

xml_str = """
<DATA>

<ROW>
<assmtid>1</assmtid>
<Year>1988</Year>
</ROW>

<ROW>
<assmtid>2</assmtid>
<Year>NULL</Year>
</ROW>

<ROW>
<assmtid>2</assmtid>
<Year>1990</Year>
</ROW>

</DATA>
"""

xml_str_update = re.sub(r">NULL", ">", xml_str)
soup = BeautifulSoup(xml_str_update, "lxml")
print soup.data.prettify().encode('utf-8').strip()

更新

经过一些测试并采纳 Jarrod Roberson 的建议后,以下是一种可能的解决方案。

import os
import xml.etree.cElementTree as etree
from cStringIO import StringIO

def getelements(xml_str):
context = iter(etree.iterparse(StringIO(xml_str), events=('start', 'end')))
event, root = next(context)

for event, elem in context:
if event == 'end' and elem.tag == "ROW":
elem.tag = elem.tag.lower()
elem.text = "\n\t\t"
elem.tail = "\n\t"

for child in elem:
child.tag = child.tag.lower()
if child.text == "NULL":
# if do not like self-closing tag,
# add &#x200B;, which is a zero width space
child.text = ""
if child.text == None:
child.text = ""
# print event, elem.tag
yield elem
root.clear()

with open(pth_to_output_xml, 'wb') as file:
# start root
file.write('<data>\n\t')
for page in getelements(xml_str):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write('</data>')

最佳答案

迭代解析

When building an in-memory tree is not desired orpractical, use an iterative parsing technique that does not rely onreading the entire source file. lxml offers two approaches: Supplyinga target parser class Using the iterparse method

import xml.etree.ElementTree as etree
for event, elem in etree.iterparse(xmL, events=('start', 'end', 'start-ns', 'end-ns')):
print event, elem

这是一个very complete tutorial关于如何做到这一点。

This will parse the XML file in chunks at a time and give it to you atevery step of the way. start will trigger when a tag is firstencountered. At this point elem will be empty except for elem.attribthat contains the properties of the tag. end will trigger when theclosing tag is encountered, and everything in-between has been read.

然后在您的事件处理程序中,您只需写出遇到的已转换信息。

关于python - 在 Python 中清理大型 XML 文件(流解析),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33704603/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com