python - 使用 Beautifulsoup，提取除指定元素之外的元素标签-6ren

python - 使用 Beautifulsoup，提取除指定元素之外的元素标签

转载作者：太空宇宙更新时间：2023-11-04 05:26:22

25

4

我正在使用 Beutifulsoup 4 和 Python 3.5+ 来提取网络数据。我有以下 html，我正在从中提取:

<div class="the-one-i-want">
    <p>
        content
    </p>
    <p>
        content
    </p>
    <p>
        content
    </p>
    <p>
        content
    </p>
    <ol>
        <li>
            list item
        </li>
        <li>
            list item
        </li>
    </ol>
    <div class='something-i-don't-want>
        content
    </div>
    <script class="something-else-i-dont-want'>
        script
    </script>
    <p>
        content
    </p>
</div>

我要提取的所有内容都在 <div class="the-one-i-want"> 中找到元素。现在，我正在使用以下大部分时间都有效的方法:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')

这不包括脚本，奇怪的插入 div的和其他不可预测的内容，例如广告或“推荐内容”类型的内容。

现在，在某些情况下，除了 <p> 之外还有其他元素。标签，其中包含对主要内容具有上下文重要性的内容，例如列表。

有没有办法从<div class="the-one-i-want">中获取内容？以这样的方式:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)

在哪里desired-content-elements会包含我认为适合该特定内容的每个元素吗？比如，所有<p>标签，全部 <ol>和 <li>标签，但没有 <div>或 <script>标签。

也许值得注意的是我保存内容的方法:

content_string = ''
for p in content:
    content_string += str(p)

这种方法按出现的顺序收集数据，如果我只是通过不同的迭代过程找到不同的元素类型，这将证明很难管理。如果可能的话，我希望不必管理拆分列表的重建来重新组合每个元素最初出现在内容中的顺序。

最佳答案

您可以传递您想要的标签列表:

 content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])

如果我们在你的问题 url 上运行类似的东西来寻找 p 和 pre 标签，你可以看到我们得到了两个:

   ...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]):
   ...:     print(ele)
   ...: 

<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p>
<pre><code>&lt;div class="the-one-i-want"&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;ol&gt;
        &lt;li&gt;
            list item
        &lt;/li&gt;
        &lt;li&gt;
            list item
        &lt;/li&gt;
    &lt;/ol&gt;
    &lt;div class='something-i-don't-want&gt;
        content
    &lt;/div&gt;
    &lt;script class="something-else-i-dont-want'&gt;
        script
    &lt;/script&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
&lt;/div&gt;
</code></pre>
<p>All of the content that I want to extract is found within the <code>&lt;div class="the-one-i-want"&gt;</code> element. Right now, I'm using the following methods, which work most of the time:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
</code></pre>
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p>
<p>Now, there are some instances in which there are elements other than just the <code>&lt;p&gt;</code> tags, which has content that is contextually important to the main content, such as lists.</p>
<p>Is there a way to get the content from the <code>&lt;div class="the-one-i-want"&gt;</code> in a manner as such:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
</code></pre>
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code>&lt;p&gt;</code> tags, all <code>&lt;ol&gt;</code> and <code>&lt;li&gt;</code> tags, but no <code>&lt;div&gt;</code> or <code>&lt;script&gt;</code> tags.</p>
<p>Perhaps noteworthy, is my method of saving the content:</p>
<pre><code>content_string = ''
for p in content:
    content_string += str(p)
</code></pre>
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>

关于python - 使用 Beautifulsoup，提取除指定元素之外的元素标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38507514/

25

4

0

文章推荐： c++ - fprintf 和 WriteConsole 的输出以相反的顺序发生

文章推荐： c - c中存储字符串的malloc函数

文章推荐： c - 位掩码困惑

文章推荐： python - numpy savetxt 函数中换行参数的问题

java - 指定 "other"
我正在我的 java 作业中使用 GUI，并且我必须指定 JCheckBox 中的其他内容。除了这个小要求，其他的我都完成了。我不太确定如何解决这个问题，我查阅了我的书并尝试在线研究要求: 一系列复
javascript - 指定 for 循环的终点有哪些优点和缺点？
在各种语言中(我将在这里使用 JavaScript，但我已经在 PHP 和 C++ 中以及可能在其他地方看到过它)，似乎有几种构造简单 for 循环的方法。版本 1 如下: var top = doc
javascript - css 指定 < >
有没有一种方法可以使用 CSS 指定每次“小于符号”(在键盘上 M 的右侧)或“大于符号”出现在文本中时，它应该被替换为分别是“小于”或“大于”的实际词？最佳答案 CSS 不能作用于(不能修改，即)
kerberos - 指定 SPN 的正确格式是什么？
首先，使用 setspn 命令为用户注册服务主体名称。 setspn -a CS/dummy@abc.com dummyuser setspn -l dummyuser 给出输出为 CS/dummy@
javascript - 指定 SFSafariViewController 用户代理
我在指定从 SFSafariViewController 访问时遇到问题，因为它具有与 Safari 浏览器完全相同的用户代理。我要做的是仅在 webview 内显示图片，如果在普通浏览器上查看，则
r - 指定 CFA，其中湍流是外生相关性的总和
我正在尝试用 R 语言在 lavaan 中指定一个奇怪的模型。该模型如下所示: 我的规范尝试如下所示。我发现难以实现的是将观察到的变量的唯一误差固定为唯一项的两个相关性的总和。例如，项目 y*1,2
reactjs - 指定 axios 响应数据类型
我正在构建 API 以将我的 React 应用程序与我的后端服务连接起来，我想使用 typescript 来指定 data 的类型在我的 Axios 请求中。如何在不修改其他字段的情况下更新 Axio
z3 - 指定 Z3 的初始模型值
如何为模型指定初始“软”值？该初始模型是解决类似查询的结果，并且该模型很可能具有正确的部分，甚至对于当前查询可能是正确的。目前，我正在通过增量求解和 hard/soft constraints 对此
java - 指定 Kafka 生产者的分区数
我有来自网页的以下代码 https://cwiki.apache.org/confluence/display/KAFKA/0.8.0+Producer+Example 似乎缺少的是如何配置分区数。我
Neo4jClient:指定 Cypher 解析器版本？
有没有办法在每个查询的基础上在 Neo4jClient 中指定 Cypher 解析器的版本，如 here 所述? 谢谢! 最佳答案如果您将 Neo4jClient 更新到最新版本(> 1.0.0.6
r - 指定 `curve` 绘图高度？
我有以下代码生成四个图，但它们最终被压扁(见下图)。我该如何解决这个问题？ par(mfrow=c(2,2)) curve(.5*exp(-.5*x),from=0,to=10,main="f(x)"
coldfusion - 指定 JDBC 数据库驱动程序最大线程数
我有一个 ColdFusion 10 服务器。我正在使用 JDBC 驱动程序连接到 db2 数据库。我偶然发现了这个笔记。这个设置在哪里？我还查看了 neo*.xml 文件，但没有看到任何 db 驱动
jquery - 指定 jQuery 验证插件中验证器的顺序
我想知道是否可以指定验证器的运行顺序。目前，我编写了一个自定义验证器，检查它是否为 [a-zA-Z0-9]+ 以确保登录验证我们的规则，并编写了一个远程验证器以确保登录可用，但目前远程验证器已启动在
iphone - 指定 iPhone 应用程序的最低内存要求
我的应用程序需要至少 40MB 的 RAM，因此早期的 iPhone(例如 3G、第一个 iPod touch 版本)就没有它(它们为我的应用程序提供的最大内存约为 20MB)。有没有正确的方法来禁用
java - 指定 ZonedDateTime 的时区而不更改实际日期
我有一个保存日期(不是当前日期)的 Date 对象，我需要以某种方式指定该日期为 UTC，然后将其转换为“欧洲/巴黎”，即 +1 小时。 public static LocalDateTime toL
caching - 指定 varnish 后端而不缓存
我想问你在 Varnish 代码中如何在没有缓存的情况下将请求传递到后端。我知道我可以做到并且正在发挥作用: if (req.url ~ "(\?|&)(something|somethin
module - 指定 gfortran 应在其中查找模块的目录
我目前基于模块编译程序(如主程序 foo 依赖于模块 bar )如下: gfortran -c bar.f90 gfortran -o foo.exe foo.f90 bar.o 这在 foo.f90
javascript - 指定 Meteor 包依赖项的正确方法
我正在尝试创建一个依赖于另一个 meteor 包的新 meteor 包。当我尝试 meteor add mypackage 时，出现以下错误。为什么 Meteor 不添加 mypackage 并引入它
rust - 指定 Rust 闭包生命周期
我正在制作执行器/ react 器，同时发现这是一个终生的问题。它与 async/Future 无关，可以在没有 async 糖的情况下进行复制。 use std::future::Future; s
cassandra - 指定 cqlsh 输出时区
我在 cassandra 中有一个表，其数据类型为时间戳。我正在使用 cqlsh 从数据库中获取数据，并希望更改我的时间戳列输出的输出格式。我研究了一下，发现我可以通过更改以下文件来更改时间戳输出格式

首页

博学

6Ren·AI

商城

python - 使用 Beautifulsoup，提取除指定元素之外的元素标签