html - BeautifulSoup，提取 HTML 标签中的字符串，ResultSet 对象-6ren

html - BeautifulSoup，提取 HTML 标签中的字符串，ResultSet 对象

转载作者：搜寻专家更新时间：2023-10-31 22:59:42

25

4

我很困惑如何将 ResultSet 对象与 BeautifulSoup 一起使用，即 bs4.element.ResultSet。

使用find_all()后，如何提取文本？

例子:

在bs4文档中，HTML文档html_doc看起来像:

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

首先创建 soup 并找到所有 href，

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

哪些输出

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我们还可以

for link in soup.find_all('a'):
    print(link.get('href'))

哪些输出

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

我想仅 class_="sister" 中的文本，即

Elsie
Lacie
Tillie

可以试试

for link in soup.find_all('a'):
    print(link.get_text())

但这会导致错误:

AttributeError: 'ResultSet' object has no attribute 'get_text'

最佳答案

对class_='sister' 进行find_all() 过滤。

注意:请注意 class 之后的下划线。这是一种特殊情况，因为类是保留字。

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你拥有了所有的 sister 标签，调用它们的 .text 来获取文本。请务必去除文本。

例如:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()

输出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie

关于html - BeautifulSoup，提取 HTML 标签中的字符串，ResultSet 对象，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33510881/

25

4

0

文章推荐： node.js - NodeJS 应该是独立的吗(即没有 apache nginx)

文章推荐： php - 使用 D7 构建一个新的 Drupal 站点

文章推荐： node.js - 新建windows系统注册表值

文章推荐： PHP - 根据另一个数组的元素对数组元素进行排序 :)

首页

博学

6Ren·AI

商城