gpt4 book ai didi

具有嵌套元素的 Python LXML iterparse

转载 作者:太空宇宙 更新时间:2023-11-04 01:40:20 24 4
gpt4 key购买 nike

我想检索 XML 文件中特定元素的内容。然而,在 XML 元素中,还有其他 XML 元素,它们破坏了父标记中内容的正确提取。一个例子:

xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('end',), tag='claim-text')
for event, element in context:
print element.text

结果是:

a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;
None

但是,例如,“a protective uniform for use ..”被遗漏了。似乎忽略了具有其他内部元素的“声明文本”的每个元素。我应该如何更改 XML 的解析以获取所有声明?

谢谢

我刚刚用“普通”SAX 解析器方法解决了这个问题:

class SimpleXMLHandler(object):

def __init__(self):
self.buffer = ''
self.claim = 0

def start(self, tag, attributes):
if tag == 'claim-text':
if self.claim == 0:
self.buffer = ''
self.claim = 1

def data(self, data):
if self.claim == 1:
self.buffer += data

def end(self, tag):
if tag == 'claim-text':
print self.buffer
self.claim = 0

def close(self):
pass

最佳答案

您可以使用 xpath 查找并连接每个 <claim-text> 下的所有文本节点。节点,像这样:

from StringIO import StringIO
from lxml import etree
xml = '''<?xml version='1.0' ?><test><claim-text><b>2</b>. A protective uniform for use by a person in combat or law enforcement, said uniform comprising: <claim-text>a. an upper body garment and a separate lower body garment</claim-text> <claim-text>b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;</claim-text></claim-text></test>'''

context = etree.iterparse(StringIO(xml), events=('start',), tag='claim-text')
for event, element in context:
print ''.join(element.xpath('text()'))

输出:

. A protective uniform for use by a person in combat or law enforcement, said uniform comprising:  
a. an upper body garment and a separate lower body garment
b. a plurality of a ballistic and non-ballistic panels for attaching to the upper body garment and the lower body garment, and;

关于具有嵌套元素的 Python LXML iterparse,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5732291/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com