gpt4 book ai didi

Python 多页 Web 仅抓取文本

转载 作者:太空宇宙 更新时间:2023-11-03 23:52:36 29 4
gpt4 key购买 nike

我是 python 新手。我目前正在研究网络抓取。任务是抓取戴尔社区 Inspiron 问题的前 5 页。我有运行并返回我需要的信息的代码。但是,我无法仅获取文本。我当前的代码返回文本 + html。我曾尝试在代码的不同位置放置 .text,但这样做时只会出现错误。

最常见的错误是:“AttributeError:ResultSet 对象没有属性‘text’。您可能将项目列表视为单个项目。当您打算调用 find() 时是否调用了 find_all()? "

下面是我的代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import os, csv
from time import sleep



pages = ['https://www.dell.com/community/Inspiron/bd-p/Inspiron',
'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/2',
'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/3',
'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/4',
'https://www.dell.com/community/Inspiron/bd-p/Inspiron/page/5'

]
import requests
data = []

for page in pages:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
rows = soup.select('tbody tr')

for row in rows:
d = dict()
d['title'] = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
d['author'] = soup.find_all ('span', attrs = {'class': 'login-bold'})
d['time'] = soup.find_all ('span', attrs = {'class': 'local-time'})
d['kudos'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-kudos-count'})
d['messages'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-message-replies-count'})
d['views'] = soup.find_all ('div', attrs = {'class': 'lia-component-messages-column-topic-views-count'})
d['solved'] = soup.find_all ('td', attrs = {'aria-label': 'triangletop lia-data-cell-secondary lia-data-cell-icon'})
d['latest']= soup.find_all ('span', attrs = {'cssclass': 'lia-info-area-item'})
data.append(d)

sleep(10)
print(data[0])

非常感谢任何帮助。谢谢!

最佳答案

find_all 返回 html 元素的列表。如果您希望打印每个元素的文本,您需要遍历使用 find_all 创建的每个列表,然后将 .text 方法应用于每个单独的条目.例如:

titles = soup.find_all ('a', attrs = {'class': 'page-link lia-link-navigation lia-custom-event'})
for title in titles:
print(title.text())

关于Python 多页 Web 仅抓取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58850012/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com