gpt4 book ai didi

python - Beautifulsoup:获取一系列 div

转载 作者:行者123 更新时间:2023-12-01 00:05:27 25 4
gpt4 key购买 nike

我刚刚了解了如何使用 BeautifulSoup 在 python 中处理网页。有一个 div 列表,我想从中获取特定范围内的内容。该范围由两个具有 h2 子级的 div 定义。我该怎么做呢?感谢您的支持!

编辑:我在下面添加了 html 代码的实际表示,而不是之前缺少标签的“简化”版本。新代码显示了一个带有 foo-bar-details 类的根 div。嵌套有 9 个 div 标签。其中两个具有嵌套的 h2 标记。所有这 9 个 div 标签都包含深度嵌套的 img 元素。我需要的是包含 h2 元素的之间那些 div 的每个 img 元素。如果应用于下面的 html 代码,预期结果将是:

<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">

这是html代码:

<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-39826.html"><img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 "></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary2.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-223234.html"><img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 "></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary1.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-223823.html"><img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 "></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-4-1">Foo feature</a> </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-123456.html"><img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456"></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><a href="../test.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<a href="../link.pdf" class="my-link" title="title"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-67890.html"><img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890"></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-5-1">Bar feature</a> </li>
...
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-39826.html"><img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 "></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-209876.html"><img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 "></a>
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>

最佳答案

这是一个涉及lxml.html的解决方案:

我们提取 firstlast 之间包含 h2 的所有 div 标签:

import lxml.html


# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())

# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1 # add one as range stops at k - 1
continue

# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]

# Or
img_list = [_.xpath('.//img/@src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]

关于python - Beautifulsoup:获取一系列 div,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60016647/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com