gpt4 book ai didi

python - lxml 删除标签内未包装的文本

转载 作者:行者123 更新时间:2023-11-30 22:52:16 25 4
gpt4 key购买 nike

这是我的带有 lxml 的 python 代码

import urllib.request
from lxml import etree
#import lxml.html as html
from copy import deepcopy
from lxml import etree
from lxml import html


some_xml_data = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
root = etree.fromstring(some_xml_data)
[c] = root.xpath('//span')
print(etree.tostring(root)) #b'<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>' #output as expected
#but if i do some changes
for e in c.iterchildren("*"):
if e.tag == 'div':
e.getparent().remove(e)

print(etree.tostring(root)) #b'<span>text1</span>' text2 and text3 removed! how to prevent this deletion?

看起来就像我对 lxml 树做了一些更改之后(删除一些标签)lxml 还删除了一些未包装的文本!如何防止 lxml 这样做并保存展开的文本?

最佳答案

节点之后的文本称为tail,它们可以通过附加到父级文本来保留,下面是一个示例:

In [1]: from lxml import html

In [2]: s = "<span>text1<div>ddd</div>text2<div>ddd</div>text3</span>"
...:

In [3]: tree = html.fromstring(s)

In [4]: for node in tree.iterchildren("div"):
...: if node.tail:
...: node.getparent().text += node.tail
...: node.getparent().remove(node)
...:

In [5]: html.tostring(tree)
Out[5]: b'<span>text1text2text3</span>'

我使用 html 因为它比 xml 更可能是结构。您可以简单地将 iterchildrendiv 结合使用,以避免对标签进行额外检查。

关于python - lxml 删除标签内未包装的文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38661087/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com