gpt4 book ai didi

html - BeautifulSoup,提取 HTML 标签中的字符串,ResultSet 对象

转载 作者:搜寻专家 更新时间:2023-10-31 22:59:42 25 4
gpt4 key购买 nike

我很困惑如何将 ResultSet 对象与 BeautifulSoup 一起使用,即 bs4.element.ResultSet

使用find_all()后,如何提取文本?

例子:

bs4文档中,HTML文档html_doc看起来像:

<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>

首先创建 soup 并找到所有 href

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

哪些输出

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

我们还可以

for link in soup.find_all('a'):
print(link.get('href'))

哪些输出

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

我想 class_="sister" 中的文本,即

Elsie
Lacie
Tillie

可以试试

for link in soup.find_all('a'):
print(link.get_text())

但这会导致错误:

AttributeError: 'ResultSet' object has no attribute 'get_text'

最佳答案

class_='sister' 进行find_all() 过滤。

注意:请注意 class 之后的下划线。这是一种特殊情况,因为类是保留字。

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

来源: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

一旦你拥有了所有的 sister 标签,调用它们的 .text 来获取文本。请务必去除文本。

例如:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link2">
Tillie
</a>
; and they lived at the bottom of a well.
</p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
print tag.text.strip()

输出:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie

关于html - BeautifulSoup,提取 HTML 标签中的字符串,ResultSet 对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33510881/

25 4 0