gpt4 book ai didi

python - 如何在python中解析html标签层次结构?

转载 作者:太空宇宙 更新时间:2023-11-03 21:11:48 24 4
gpt4 key购买 nike

我有一个 html 页面,我正在其中使用 beautiful soup 提取所有标题(h1h7),现在我想要一个列表,我想在其中附加所有标题直接更高级别的标签到当前标签。

例如,我有这个示例 html 页面:

<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>
<h1>dummy h1</h1>
<h1>head 1</h1>
<p>para 1</p>
<h2>head 2</h2>
<p>para 2</p>
<h3>head 3</h3>
<p>p for head3</p>
<h2>head2(2)</h2>
<p>para3</p>
<h1>head1(2)</h1>
<h2>2nd h2</h2>
<h3>2nd h3</h3>
<p>2nd p for h3</p>
</body>
</html>

这里我想要的列表应该如下所示

['head1','head1 head2','head1 head2 head3','head1 head2(2)','head1(2)','head1(2) 2nd h2','head1(2) 2nd h2 2nd h3']

我使用的逻辑是,在从当前 h 标签向后遍历时,一旦遇到较小的 h 标签,就会中断循环。这会产生一个问题,因为循环在从 head2(2) 往回遍历时在 head3 处中断,理想情况下它应该向上到达 head1。这是我尝试过的代码:

file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy

head=[]
h=[]
h3=[]

for ele in start.next_siblings:
for i,tag in enumerate(tags):
if (ele.name==tag):
head.append('')
h.append(ele)
h3=deepcopy(h)
h3.reverse()
for j, q in enumerate(h3):
if q.name in tags[:i]:
head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]

if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
continue

if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)):
break

head[len(head)-1]+=(ele.text.strip())+' '
break
print(head)

请建议我该怎么做才能避免这个问题。

最佳答案

我发现你的算法出了什么问题。您只需在 break 条件

中对 q.name 的值进行测试即可
if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
break

所以完整的代码是:

file = open("sample.html","r")
page = file.read()
soup = BeautifulSoup(page, 'html.parser')
tags=['h1','h2','h3','h4','h5','h6','h7']
start=soup.find('h1') # the page I am working on starts with a dummy

head=[]
h=[]
h3=[]

for ele in start.next_siblings:
for i,tag in enumerate(tags):
if (ele.name==tag):
head.append('')
h.append(ele)
h3=deepcopy(h)
h3.reverse()
for j, q in enumerate(h3):

if q.name in tags[:i]:
head[len(head)-1]=(q.text.strip()) + ' ' + head[len(head)-1]

if j < len(h)-1 and (tags.index(q.name) == tags.index(h3[j+1].name)):
continue

if j < len(h)-1 and (tags.index(q.name) < tags.index(h3[j+1].name)) and q.name == 'h1':
break

head[len(head)-1]+=(ele.text.strip())+' '
break
print(head)

输出:

['head 1 ', 'head 1 head 2 ', 'head 1 head 2 head 3 ', 'head 1 head2(2) ', 'head1(2) ', 'head1(2) 2nd h2 ', 'head1(2) 2nd h2 2nd h3 ']

如果有帮助请告诉我:-)

关于python - 如何在python中解析html标签层次结构?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55015995/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com