gpt4 book ai didi

python - 删除子元素包含 "English"的元素的更有效方法

转载 作者:行者123 更新时间:2023-12-03 18:49:23 26 4
gpt4 key购买 nike

我有一个 html 如下

<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>

<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>

<details data-level="2" open="">
<summary class="section-heading"><h2 id="French">French</h2></summary>
<details data-level="3" open="">abc</details>
</details>

<details data-level="2" open="">
<summary class="section-heading"><h2 id="Norman">Norman</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
在每个元素内部 <details data-level="2" open=""> ,有一个元素 <h2 id="English">English</h2>来表示语言。我的目标是删除所有 <details data-level="2" open="">其语言不同于 English .我的预期结果是
<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>

<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
我得到这样的结果
from bs4 import BeautifulSoup

texte = """
<div id="bodyContent" class="content mw-parser-output">
<div id="mw-content-text" style="direction: ltr;">
<h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
<span class="mw-headline" id="title_0">pomme</span>
</h1>

<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
</div>
</div>
"""

soup = BeautifulSoup(texte, 'html.parser')
tmp = soup.select('details > summary > h2')
tmp2 = [s.contents[0] for s in tmp]

for i in range(len(tmp2)):
if tmp2[i] != 'English':
tmp[i].find_parent('details').decompose()

soup
我需要重复这个操作近 400 万次。我想问有没有更有效的方法来做到这一点。非常感谢你的帮助!

最佳答案

您可以通过 :not() 使用 CSS 选择器然后 .extract()所选元素:

for d in soup.select('details[data-level="2"]:not(:has(h2#English))'):
d.extract()

print(soup.prettify())
打印:
<div class="content mw-parser-output" id="bodyContent">
<div id="mw-content-text" style="direction: ltr;">
<h1 aria-haspopup="true" class="section-heading" data-section-id="0" tabindex="0">
<span class="mw-headline" id="title_0">
pomme
</span>
</h1>
<details data-level="2" open="">
<summary class="section-heading">
<h2 id="English">
English
</h2>
</summary>
<details data-level="3" open="">
abc
</details>
</details>
</div>
</div>

关于python - 删除子元素包含 "English"的元素的更有效方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67204796/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com