gpt4 book ai didi

python - 使用 Beautifulsoup,提取除指定元素之外的元素标签

转载 作者:太空宇宙 更新时间:2023-11-04 05:26:22 25 4
gpt4 key购买 nike

我正在使用 Beutifulsoup 4 和 Python 3.5+ 来提取网络数据。我有以下 html,我正在从中提取:

<div class="the-one-i-want">
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<p>
content
</p>
<ol>
<li>
list item
</li>
<li>
list item
</li>
</ol>
<div class='something-i-don't-want>
content
</div>
<script class="something-else-i-dont-want'>
script
</script>
<p>
content
</p>
</div>

我要提取的所有内容都在 <div class="the-one-i-want"> 中找到元素。现在,我正在使用以下大部分时间都有效的方法:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')

这不包括脚本,奇怪的插入 div的和其他不可预测的内容,例如广告或“推荐内容”类型的内容。

现在,在某些情况下,除了 <p> 之外还有其他元素。标签,其中包含对主要内容具有上下文重要性的内容,例如列表。

有没有办法从<div class="the-one-i-want">中获取内容?以这样的方式:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)

在哪里desired-content-elements会包含我认为适合该特定内容的每个元素吗?比如,所有<p>标签,全部 <ol><li>标签,但没有 <div><script>标签。

也许值得注意的是我保存内容的方法:

content_string = ''
for p in content:
content_string += str(p)

这种方法按出现的顺序收集数据,如果我只是通过不同的迭代过程找到不同的元素类型,这将证明很难管理。如果可能的话,我希望不必管理拆分列表的重建来重新组合每个元素最初出现在内容中的顺序。

最佳答案

您可以传递您想要的标签列表:

 content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])

如果我们在你的问题 url 上运行类似的东西来寻找 p 和 pre 标签,你可以看到我们得到了两个:

   ...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]):
...: print(ele)
...:

<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p>
<pre><code>&lt;div class="the-one-i-want"&gt;
&lt;p&gt;
content
&lt;/p&gt;
&lt;p&gt;
content
&lt;/p&gt;
&lt;p&gt;
content
&lt;/p&gt;
&lt;p&gt;
content
&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;
list item
&lt;/li&gt;
&lt;li&gt;
list item
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class='something-i-don't-want&gt;
content
&lt;/div&gt;
&lt;script class="something-else-i-dont-want'&gt;
script
&lt;/script&gt;
&lt;p&gt;
content
&lt;/p&gt;
&lt;/div&gt;
</code></pre>
<p>All of the content that I want to extract is found within the <code>&lt;div class="the-one-i-want"&gt;</code> element. Right now, I'm using the following methods, which work most of the time:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
</code></pre>
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p>
<p>Now, there are some instances in which there are elements other than just the <code>&lt;p&gt;</code> tags, which has content that is contextually important to the main content, such as lists.</p>
<p>Is there a way to get the content from the <code>&lt;div class="the-one-i-want"&gt;</code> in a manner as such:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
</code></pre>
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code>&lt;p&gt;</code> tags, all <code>&lt;ol&gt;</code> and <code>&lt;li&gt;</code> tags, but no <code>&lt;div&gt;</code> or <code>&lt;script&gt;</code> tags.</p>
<p>Perhaps noteworthy, is my method of saving the content:</p>
<pre><code>content_string = ''
for p in content:
content_string += str(p)
</code></pre>
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>

关于python - 使用 Beautifulsoup,提取除指定元素之外的元素标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38507514/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com