gpt4 book ai didi

Python/bs4 : Span inside div tag - text extraction

转载 作者:太空宇宙 更新时间:2023-11-03 18:15:41 26 4
gpt4 key购买 nike

我正在从 div 标签中提取文本。重点是div标签里面有一个没有开对的标签。所以如果我这样做:raw = soup.find('div', class_='inside').text我只得到标签之前的文本。

一个例子:

<div class='inside'><div>sth0</div><div>sth1</div></span><div>sth2<div></div>

soup.find('div', class_='inside').text

>>> sth0 sth1

您知道如何从 div 标签获取整个文本吗?谢谢

编辑(根据 Tanmaya Meher 的说法,上面的代码应该可以工作,但对我来说不行,所以我附加了确切的问题

当我运行此代码时:

raw = firmHtml.find('div', class_='inside').text
print raw

我明白

Katalóg   Obchody a veľkoobchod

而不是:

Katalóg   Obchody a veľkoobchod   Stavebniny   Izolačný materiál...

这是我的代码的一部分。

<div class="inside"><div class="inside2"><a href="/katalog/" style="font-size:12px" title="Katalóg"><span>Katalóg</span></a> <span class="sipka s1">&nbsp;</span> <a href="/katalog/obchody-a-velkoobchod/" style="font-size:12px" itemprop="url" title="Obchody a veľkoobchod"><span itemprop="title" >Obchody a veľkoobchod</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/stavebniny_1/" style="font-size:12px" itemprop="url" title="Stavebniny"><span itemprop="title" >Stavebniny</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/izolacny-material/" style="font-size:12px" itemprop="url" title="Izolačný materiál"><span itemprop="title" >Izolačný materiál</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/protipoziarne-izolacie/" style="font-size:12px" itemprop="url" title="Protipožiarne izolácie"><span itemprop="title" >Protipožiarne izolácie</span></a></span> <span class="sipka s1">&nbsp;</span> Ing. Milan Kalafut</div></div></div><div id="main"><div id="content"><div itemscope itemtype="http://schema.org/LocalBusiness" class="business-container"><div id="lavy"><div class="foto s3"><img src="http://s.aimg.sk/katalog/css/images/nologo.gif" alt="Logo nieje k dispozícii" /></div><div id="moznosti">

也许我看不到什么。

最佳答案

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup as BS

html_text = '<div class="inside"><div class="inside2"><a href="/katalog/" style="font-size:12px" title="Katalóg"><span>Katalóg</span></a> <span class="sipka s1">&nbsp;</span> <a href="/katalog/obchody-a-velkoobchod/" style="font-size:12px" itemprop="url" title="Obchody a veľkoobchod"><span itemprop="title" >Obchody a veľkoobchod</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/stavebniny_1/" style="font-size:12px" itemprop="url" title="Stavebniny"><span itemprop="title" >Stavebniny</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/izolacny-material/" style="font-size:12px" itemprop="url" title="Izolačný materiál"><span itemprop="title" >Izolačný materiál</span></a></span> <span class="sipka s1">&nbsp;</span> <span itemprop="child" itemscope itemtype="http://data-vocabulary.org/Breadcrumb" ><a href="/katalog/protipoziarne-izolacie/" style="font-size:12px" itemprop="url" title="Protipožiarne izolácie"><span itemprop="title" >Protipožiarne izolácie</span></a></span> <span class="sipka s1">&nbsp;</span> Ing. Milan Kalafut</div></div></div><div id="main"><div id="content"><div itemscope itemtype="http://schema.org/LocalBusiness" class="business-container"><div id="lavy"><div class="foto s3"><img src="http://s.aimg.sk/katalog/css/images/nologo.gif" alt="Logo nieje k dispozícii" /></div><div id="moznosti">'

#html_text = open("a.html",'r').read() #I have commented this, you can do like this too; a.html file contains the same html code as above

firmHtml = BS(html_text)
raw = firmHtml.find('div', class_='inside').text

print (raw)

输出(Linux 上的 Python 2.7.5 和 Python 3.3.2):

Katalóg   Obchody a veľkoobchod   Stavebniny   Izolačný materiál   Protipožiarne izolácie   Ing. Milan Kalafut

关于Python/bs4 : Span inside div tag - text extraction,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25054840/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com