gpt4 book ai didi

python - 将多个标签与 lxml 组合

转载 作者:太空宇宙 更新时间:2023-11-04 15:03:39 25 4
gpt4 key购买 nike

我有一个如下所示的 html 文件:

...
<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
2.
<strong>But do not </strong>
<strong>touch this</strong>
<em>Maybe some other tags as well.</em>
bla bla blah...
</p>
...

我需要的是,如果“p” block 中的所有标签都是“强”,则将它们组合成一行,即

<p>
<strong>This is a line which I want to join.</strong>
</p>

无需触及另一个 block ,因为它包含其他内容。

有什么建议吗?我正在使用 lxml。

更新:

到目前为止我试过:

for p in self.tree.xpath('//body/p'):
if p.tail is None: #no text before first element
children = p.getchildren()
for child in children:
if len(children)==1 or child.tag!='strong' or child.tail is not None:
break
else:
etree.strip_tags(p,'strong')

通过这些代码,我能够去除所需部分中的强标记,给出:

<p>
This is a line which I want to join.
</p>

所以现在我只需要一种方法将标签放回...

最佳答案

我能够用 bs4 (BeautifulSoup) 做到这一点:

from bs4 import BeautifulSoup as bs

html = """<p>
<strong>This is </strong>
<strong>a lin</strong>
<strong>e which I want to </strong>
<strong>join.</strong>
</p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p>"""

soup = bs(html)
s = ''
# note that I use the 0th <p> block ...[0],
# so make the appropriate change in your code
for t in soup.find_all('p')[0].text:
s = s+t.strip('\n')
s = '<p><strong>'+s+'</strong></p>'
print s # prints: <p><strong>This is a line which I want to join.</strong></p>

然后使用replace_with() :

p_tag = soup.p
p_tag.replace_with(bs(s, 'html.parser'))
print soup

打印:

<html><body><p><strong>This is a line which I want to join.</strong></p>
<p>
<strong>But do not </strong>
<strong>touch this</strong>
</p></body></html>

关于python - 将多个标签与 lxml 组合,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30836928/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com