python - BeautifulSoup 网络抓取 find_all() : finding exact match-6ren

python - BeautifulSoup 网络抓取 find_all() : finding exact match

转载作者：技术小花猫更新时间：2023-10-29 12:22:29

25

4

我正在使用 Python 和 BeautifulSoup 进行网页抓取。

假设我有以下 html 代码要抓取:

<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>

使用 BeautifulSoup，我只想找到具有属性 class="product"的产品(仅产品 1 和 2)，不是“特殊”产品

如果我执行以下操作:

result = soup.find_all('div', {'class': 'product'})

结果包括所有产品(1、2、3 和 4)。

我应该怎么做才能找到类别与“产品”完全匹配的产品？

我运行的代码:

from bs4 import BeautifulSoup
import re

text = """
<body>
    <div class="product">Product 1</div>
    <div class="product">Product 2</div>
    <div class="product special">Product 3</div>
    <div class="product special">Product 4</div>
</body>"""

soup = BeautifulSoup(text)
result = soup.findAll(attrs={'class': re.compile(r"^product$")})
print result

输出:

[<div class="product">Product 1</div>, <div class="product">Product 2</div>, <div class="product special">Product 3</div>, <div class="product special">Product 4</div>]

最佳答案

在 BeautifulSoup 4 中，class 属性(以及其他几个属性，例如 accesskey 和表格单元格元素上的 headers 属性)是作为一个集合处理；您匹配属性中列出的各个元素。这遵循 HTML 标准。

因此，您不能将搜索仅限于一个类。

你必须使用 custom function在这里与类匹配:

result = soup.find_all(lambda tag: tag.name == 'div' and 
                                   tag.get('class') == ['product'])

我使用 lambda 创建了一个匿名函数；每个标签都匹配名称(必须是 'div')，并且类属性必须完全等于列表 ['product']；例如只有一个值。

演示:

>>> from bs4 import BeautifulSoup
>>> text = """
... <body>
...     <div class="product">Product 1</div>
...     <div class="product">Product 2</div>
...     <div class="product special">Product 3</div>
...     <div class="product special">Product 4</div>
... </body>"""
>>> soup = BeautifulSoup(text)
>>> soup.find_all(lambda tag: tag.name == 'div' and tag.get('class') == ['product'])
[<div class="product">Product 1</div>, <div class="product">Product 2</div>]

为了完整起见，以下是所有此类设置属性，来自 BeautifulSoup 源代码:

# The HTML standard defines these attributes as containing a
# space-separated list of values, not a single value. That is,
# class="foo bar" means that the 'class' attribute has two values,
# 'foo' and 'bar', not the single value 'foo bar'.  When we
# encounter one of these attributes, we will parse its value into
# a list of values if possible. Upon output, the list will be
# converted back into a string.
cdata_list_attributes = {
    "*" : ['class', 'accesskey', 'dropzone'],
    "a" : ['rel', 'rev'],
    "link" :  ['rel', 'rev'],
    "td" : ["headers"],
    "th" : ["headers"],
    "td" : ["headers"],
    "form" : ["accept-charset"],
    "object" : ["archive"],

    # These are HTML5 specific, as are *.accesskey and *.dropzone above.
    "area" : ["rel"],
    "icon" : ["sizes"],
    "iframe" : ["sandbox"],
    "output" : ["for"],
    }

关于python - BeautifulSoup 网络抓取 find_all() : finding exact match，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22726860/

25

4

0

文章推荐： linux - 如何在非android项目中使用Kotlinc协程？

文章推荐： linux - bash 中的连字符和垂直线

文章推荐： html - Flexbox - 如果宽度改变，不要移动居中的列

文章推荐： c++ - 如何用触摸数据模拟鼠标点击？

exact-online - “使用选择”符号指定 Exact Online 数据容器的多个分区
我想在数据中心选择一个事件分区。通常我会使用以下语句: INVANTIVE> use 1552839 2> Exclamation itgendhb077: Error in Invantive Da
git - 使用 EXACT 文件夹结构创建包含子模块的 EXACT git 存储库并将其推送到 github
我认为我的可能是 git 子模块的最简单用例。我有一个目录结构 --- --- --- --- 每个子目录都是一个 git 存储库。我只想跟踪在我的中添加的不同 git
r - .subset2(x, i, exact = exact) 错误 : subscript out of bounds in R
我正在尝试循环数据框中的特定数字列，目标是使用“cor.test”函数提取相关性和 p 值。相关性在于计算线性关系一个分类变量，由针对每个特定数字列的 0 和 1 值组成。到目前为止，这是我的代码
exact-online - Exact Online 上的 Invantive Data Hub 查询返回太多行
当我使用 Invantive Data Hub 从多个 Exact Online 公司下载数据时，我得到了重复的行，而我希望每个公司只有一行。我使用以下查询: select gla.code ,
exact-online - 比利时 Exact Online 数据访问点上的 itgenobr001 : Client not found.
我们刚刚上线 https://ecotaksen.be 。 Exact 上的查询和更新运行良好，但安装生产许可证后出现错误 itgenobr001:找不到客户端。。我的数据容器规范是: 使用具有相
exact-online - 如何使用 Invantive Query Tool 从 Exact Online 仅下载我的采购发票文件？
为了遵守法规，我尝试从我的一些部门下载采购发票文件(PDF 文件)，将它们保存在磁盘上以供存档。我使用 Invantive 查询工具来执行此操作。我想知道使用哪个表以及如何仅针对采购发票文档导出这些
python - BeautifulSoup 问题 : How to get the exact link by matching the exact tag content?
我想获取“S-1”之后的链接，而不是“S-1/A”之后的链接。我尝试了“.find_all(lambda tag: tag.name == 'td' and tag.get()==['S-1'])”，
python - 如何修复谷歌地球引擎中的 "Manifests for TfRecord ingestion must have exactly one tileset with exactly one source"
当我尝试通过 Google Colaboratory 中的 Earthengine 命令行上传 .tfrecord 和 .json 文件时，它显示“TfRecord 摄取 list 必须具有一个具有一
security - 非法质数 : What is it exactly?
Closed. This question is off-topic 。它目前不接受答案。想改善这个问题吗？ Update the question 所以它是堆栈溢出的 on-topic。 10年前
c++ - 模板的模板成员的消歧模板关键字 : when exactly?
这里给出了一个关于模板消歧器的问题: template disambiguator 在答案中我们可以读到: ISO C++03 14.2/4 When the name of a member tem
r - 病例对照研究 "exact"与重叠时间间隔匹配
我想在考虑时间间隔的同时进行病例对照匹配。如果对照观察的自变量 X1、X2 和重叠时间间隔 X3 与一个案例具有相同的值，我想要一个匹配项。例如，假设以下 df1: row Y X1 X2
css - 什么动画:none do exactly?
我在这里有一个具有这种起始样式的 HTML 元素: transition: transform 2s; 首先是动画 (它旋转X)通过点击添加的类。下次单击时，将添加另一个类，该类添加了 transfo
iphone - EAGL : What does it stand for exactly?
我忘了，但是 EAGL 代表什么具体的东西吗？或者它只是核心动画 OpenGL 命名约定的一部分(CAEAGLLayer 等)？最佳答案 “AGL”是苹果 OS X 的 OpenGL 扩展的名称。我
Angular 树摇晃 : How exactly does it work?
我们目前正在尝试优化复杂的 Angular 应用程序(性能和包大小)。我们发现我们有部分未使用的组件，但我们不能 100% 确定它们。无论如何......我们目前要问的问题是，摇树在 Angular
R解决:system is exactly singular
我正在解决简单的优化问题。该数据集有 26 列和 3000 多行。源代码看起来像 Means <- colMeans(Returns) Sigma <- cov(Returns) invSi
安卓， Kotlin : What exactly is called here?
我让 Android Studio 将我的代码转换为 OnClickListener . 显然这里使用了 lambda。我不知道 lambda 是传递给 View 类的函数还是传递给 OnClickL
c - "What this value exactly equal to?"
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 3 年前。 Improve th
java - 转换到按钮控件(Android): what exactly is that?
关于“转换”的可用(类似)问题并没有真正阐明这是什么或做什么(顺便说一下，刚开始进行 Android 编程)。人们在哪里以及如何注意到“类型转换”的效果？有什么区别: Button b = (But
php - 定点类型 - "exact value or not"？
我需要创建一个列，其中可以存储“0.0 - 99.99”之间的值。为什么？由于这种情况: 我的数据库中有这个表: "CREATE TABLE dumps( id INT
MySQL - "exact match"针对某个值
我正在摸不着头脑，经过一天的互联网搜索，我决定问你这个问题。我有一个包含 2 个字段 tag_id 和 tag 的表 TAG，我试图将 TAG 的记录与特定字符串完全匹配，但我无法完全匹配，只能部分

首页

博学

6Ren·AI

商城

python - BeautifulSoup 网络抓取 find_all() : finding exact match