gpt4 book ai didi

python-3.x - 从 Div 标签中提取文本数据,而不是从子 H3 标签中提取文本数据

转载 作者:行者123 更新时间:2023-12-05 00:11:08 27 4
gpt4 key购买 nike

我有一个 HTML 片段,我需要使用 BeautifuSoup 获取数据:

<!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>

我需要获取 .text第二个的值 div来自顶部的“John Doe”,但不是 .text h3 内的值标签在那 div .
我的挑战是,目前我得到了这个代码片段中的两个文本值:
# Python 3.7, BeautifulSoup 4.7
# html variable is equal to the above HTML snippet
from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()
print(school_head_teacher)

这输出:
Teacher
John Doe

但是,我只需要 John Doe 值。

最佳答案

我提供了2个解决方案。第一个不是最优雅的解决方案。但是很快就从我的头顶上下来,你可以在“老师”之后再次将其拆分并连接在一起

选项 1:

html = '''
!doctype html>
<html lang="en">
<body>
<div class="sidebar-box">
<h3><i class="fa fa-users"></i> Management Team</h3>
Chairman, Director
</div>
<div class="sidebar-box">
<h3><i class="fa fa-male"></i> Teacher</h3>
John Doe
</div>
<div class="sidebar-box">
<h3><i class="fa fa-mortar-board"></i> Awards </h3>
National Top Quality Educational Development
</div>
<div class="sidebar-box">
<h3><i class="fa fa-building"></i> School Type</h3>
Secondary
</div>
</body>
</html>'''



from bs4 import BeautifulSoup
soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
school_head_teacher = soup4.find_all('div', {'class':'sidebar-box'})
school_head_teacher = school_head_teacher[1].text.strip()

school_head_teacher = school_head_teacher.split()[1:]
school_head_teacher = ' '.join(school_head_teacher)

print(school_head_teacher)

输出:
print(school_head_teacher)
John Doe

选项 2:

这个我觉得好一点。您找到带有 Teacher 的标签.然后你得到父标签。然后因为你想要第二部分,所以你使用 .next_sibling和剥离它。
soup4(text=re.compile('Teacher'))[0].parent.next_sibling.strip()

我把它放在一个 for 循环中,以防有多个老师。但是您可以替换顶部代码而不是 for环形
from bs4 import BeautifulSoup
import re

soup4 = BeautifulSoup(html, "html.parser")
# Get School Head Teacher
for elem in soup4(text=re.compile('Teacher')):
print (elem.parent.next_sibling.strip())

关于python-3.x - 从 Div 标签中提取文本数据,而不是从子 H3 标签中提取文本数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54707259/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com