gpt4 book ai didi

python - 从 pretty-print 的 XML 中删除元素后留下不需要的空白

转载 作者:行者123 更新时间:2023-12-05 08:00:55 26 4
gpt4 key购买 nike

我是 LXML 的新手,在解析我的元素后遇到问题:如果我删除(或替换)最后一个子元素,它的体系结构似乎已经改变。这是我的代码。

(抱歉,我是 stackoverflow 的新手,所以我无法发布图片)

我已经查找了解决方案,但我仍然无法确定我做错了什么。我真的很感激有人的帮助!(我在 Windows 上使用 LXML 3.2.1 和 Python 2.6)

from lxml import etree
from copy import deepcopy

def Write( file, element ):
f = open( file, 'w' )
f.write( etree.tostring( element, xml_declaration=True, encoding="ISO-8859-1", pretty_print = True ) )
f.close()
return 1

def ReadAndReturn( file ):
lookup = etree.ElementDefaultClassLookup()
parser = etree.XMLParser(recover = True)
parser.set_element_class_lookup( lookup )
mainTree = etree.parse( file, parser )
return mainTree

# create a root element with 3 children
root = etree.Element( "root" )
root.append( etree.Element( "child1" ) )
child2 = etree.SubElement( root, "child2" )
child2.text = 'CHILD2'
child3 = etree.SubElement( root, "child3" )
child3.text = 'CHILD3'

print "\n--- INITIAL ROOT ---"
print( etree.tostring( root, pretty_print=True ) )

# remove last child
root2 = deepcopy( root )
root2.remove( root2[2] )

print "--- ROOT WITHOUT LAST CHILD / BEFORE WRITING ---"
print( etree.tostring( root2, pretty_print=True ) )


# write initial root (3 children) and read the file
filename = 'test.tst'
status = Write( filename, root )
tree = ReadAndReturn( filename )

# remove last child from the read element
root3 = deepcopy( tree.getroot() )
root3.remove( root3[2] )

print "--- ROOT WITHOUT LAST CHILD / AFTER WRITING AND PARSING ---"
print( etree.tostring( root3, pretty_print=True ) )

最佳答案

空白处理可能很棘手。这是您的程序的简化版本,它演示了正在发生的事情。

from lxml import etree

# Create a root element with 3 children
root = etree.Element( "root" )
root.append( etree.Element( "child1" ) )
child2 = etree.SubElement( root, "child2" )
child2.text = 'CHILD2'
child3 = etree.SubElement( root, "child3" )
child3.text = 'CHILD3'

# Print the "ugly" XML (no whitespace)
print "\n--- UGLY ---"
print etree.tostring(root)

# Print the "pretty" XML
print "\n--- PRETTY ---"
pp = etree.tostring(root, pretty_print=True)
print pp

# Parse the pretty XML
tree = etree.fromstring(pp)

# remove last child
tree.remove(tree[2])

print "--- WITHOUT LAST CHILD PART 1 ---"
print etree.tostring(tree, pretty_print=True)

# Parse the pretty XML once again with parser option 'remove_blank_text=True'
tree = etree.fromstring(pp, etree.XMLParser(remove_blank_text=True))

# remove last child
tree.remove(tree[2])

print "--- WITHOUT LAST CHILD PART 2 ---"
print etree.tostring(tree, pretty_print=True)

输出:

--- UGLY ---
<root><child1/><child2>CHILD2</child2><child3>CHILD3</child3></root>

--- PRETTY ---
<root>
<child1/>
<child2>CHILD2</child2>
<child3>CHILD3</child3>
</root>

--- WITHOUT LAST CHILD PART 1 ---
<root>
<child1/>
<child2>CHILD2</child2>
</root>

--- WITHOUT LAST CHILD PART 2 ---
<root>
<child1/>
<child2>CHILD2</child2>
</root>

child2 pretty-print 的 XML 文档中的元素有一个 .tail\n 组成的属性(property)后跟两个空格。您可以通过 repr(pp) 查看.这两个空格是导致 </root> 的原因结束标记未对齐。

如果使用解析器选项 remove_blank_text=True 解析 pretty-print 的 XML 文档,那么将不会有干扰性的纯空白元素尾部,最后(“第 2 部分”) pretty-print 将按预期工作。

另见:

关于python - 从 pretty-print 的 XML 中删除元素后留下不需要的空白,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16565966/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com