python - 网页抓取 : scrape multiple webs by Python-6ren

python - 网页抓取 : scrape multiple webs by Python

转载作者：行者123 更新时间：2023-11-28 18:04:38

from bs4 import BeautifulSoup
import requests

url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
  pg = url + '?page=' + str(pg)
  soup = BeautifulSoup(page.content, 'lxml')
  for paragraph in soup.find_all('p'):
     print(paragraph.text)

我想从https://uk.trustpilot.com/review/thread.com中抓取排名、评论和评论日期，但是，我不知道如何从多个页面中抓取并为抓取结果制作一个 pandas DataFrame

最佳答案

您好，您需要向每个页面发送请求，然后处理响应。此外，由于某些项目不能直接作为标签中的文本提供，因此您要么从 javascript 中获取它(我使用 json 加载这样的日期)，要么从类名中获取它(我得到这样的评级)。

from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     title=paragraph.find('h2',class_='review-content__title').text.strip()
     content=paragraph.find('p',class_='review-content__text').text.strip()
     datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
     date=datedata['publishedDate'].split('T')[0]
     rating_class=paragraph.find('div',class_='star-rating')['class']
     rating=rating_class[1].split('-')[-1]
     final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

输出

                                                Title                                            Content        Date Rating
0                      I ordered a jacket 2 weeks ago  I ordered a jacket 2 weeks ago.  Still hasn't ...  2019-01-13      1
1              I've used this service for many years…  I've used this service for many years and get ...  2018-12-31      4
2                                       Great website  Great website, tailored recommendations, and e...  2018-12-19      5
3              I was excited by the prospect offered…  I was excited by the prospect offered by threa...  2018-12-18      1
4       Thread set the benchmark for customer service  Firstly, their customer service is second to n...  2018-12-12      5
5                                    It's a good idea  It's a good idea.  I am in between sizes and d...  2018-12-02      3
6                             Great experience so far  Great experience so far. Big choice of clothes...  2018-10-31      5
7                    Absolutely love using Thread.com  Absolutely love using Thread.com.  As a man wh...  2018-10-31      5
8                 I'd like to give Thread a one star…  I'd like to give Thread a one star review, but...  2018-10-30      2
9            Really enjoying the shopping experience…  Really enjoying the shopping experience on thi...  2018-10-22      5
10                         The only way I buy clothes  I absolutely love Thread. I've been surviving ...  2018-10-15      5
11                                  Excellent Service  Excellent ServiceQuick delivery, nice items th...  2018-07-27      5
12             Convenient way to order clothes online  Convenient way to order clothes online, and gr...  2018-07-05      5
13                Superb - would thoroughly recommend  Recommendations have been brilliant - no more ...  2018-06-24      5
14                    First time ordering from Thread  First time ordering from Thread - Very slow de...  2018-06-22      1
15          Some of these criticisms are just madness  I absolutely love thread.com, and I can't reco...  2018-05-28      5
16                                       Top service!  Great idea and fantastic service. I just recei...  2018-05-17      5
17                                      Great service  Great service. Great clothes which come well p...  2018-05-05      5
18                                          Thumbs up  Easy, straightforward and very good costumer s...  2018-04-17      5
19                 Good idea, ruined by slow delivery  I really love the concept and the ordering pro...  2018-04-08      3
20                                      I love Thread  I have been using thread for over a year. It i...  2018-03-12      5
21      Clever simple idea but.. low quality clothing  Clever simple idea but.. low quality clothingL...  2018-03-12      2
22                      Initially I was impressed....  Initially I was impressed with the Thread shop...  2018-02-07      2
23                                 Happy new customer  Joined the site a few weeks ago, took a short ...  2018-02-06      5
24                          Style tips for mature men  I'm a man of mature age, let's say a "baby boo...  2018-01-31      5
25            Every shop, every item and in one place  Simple, intuitive and makes online shopping a ...  2018-01-28      5
26                     Fantastic experience all round  Fantastic experience all round.  Quick to regi...  2018-01-28      5
27          Superb "all in one" shopping experience …  Superb "all in one" shopping experience that i...  2018-01-25      5
28  Great for time poor people who aren’t fond of ...  Rally love this company. Super useful for thos...  2018-01-22      5
29                            Really is worth trying!  Quite cautious at first, however, love the way...  2018-01-10      4
30           14 days for returns is very poor given …  14 days for returns is very poor given most co...  2017-12-20      3
31                  A great intro to online clothes …  A great intro to online clothes shopping. Usef...  2017-12-15      5
32                           I was skeptical at first  I was skeptical at first, but the service is s...  2017-11-16      5
33            seems good to me as i hate to shop in …  seems good to me as i hate to shop in stores, ...  2017-10-23      5
34                          Great concept and service  Great concept and service. This service has be...  2017-10-17      5
35                                      Slow dispatch  My Order Dispatch was extremely slow compared ...  2017-10-07      1
36             This company sends me clothes in boxes  This company sends me clothes in boxes! I find...  2017-08-28      5
37          I've been using Thread for the past six …  I've been using Thread for the past six months...  2017-08-03      5
38                                             Thread  Thread, this site right here is literally the ...  2017-06-22      5
39                                       good concept  The website is a good concept in helping buyer...  2017-06-14      3

注意:虽然我能够“破解”我的方式来获取该站点的结果，但最好使用 selenium 来抓取动态页面。

编辑自动找出页数的代码

from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
  pg = url + '?page=' + str(pg)
  r=requests.get(pg)
  soup = BeautifulSoup(r.text, 'lxml')
  for paragraph in soup.find_all('section',class_='review__content'):
     try:
         title=paragraph.find('h2',class_='review-content__title').text.strip()
         content=paragraph.find('p',class_='review-content__text').text.strip()
         datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
         date=datedata['publishedDate'].split('T')[0]
         rating_class=paragraph.find('div',class_='star-rating')['class']
         rating=rating_class[1].split('-')[-1]
         final_list.append([title,content,date,rating])
     except AttributeError:
        pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

关于python - 网页抓取 : scrape multiple webs by Python，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54174187/

文章推荐： javascript - 为什么我们不能递增 (++) 或递减 (--) 数字文字

文章推荐： python - 对具有初始条件的方程使用 scipy solve_ivp

文章推荐： ios - 我们如何才能确定文件*不*存在于 iOS 中？

文章推荐： python - 使用 Ansible 进行 Vagrant 配置 - mysql_db 找不到 PyMySQL

screen-scraping - Perl : HTML Scraping from an Authenticated website
虽然 HTML Scraping 从我所看到的内容中得到了很好的记录，并且我了解它的概念和实现，但是从隐藏在身份验证表单后面的内容中进行抓取的最佳方法是什么。我指的是从我合法访问的内容中抓取，因此我正
python - 抓取 : scraped links - now unable to scrape and dump html files into a folder
使用 Python、Selenium、Sublime 和 Firefox:我正在从这个网站上抓取链接，并想将抓取的页面(作为 html 文件)保存到一个文件夹中。但是，我已经工作了好几天，试图将这些
javascript - Python : How to scrape a page to get an information that will be used to scrape another one, 等等？
我需要构建一个 python 脚本，旨在抓取网页以检索“显示更多”按钮中的数字。此数字将用作请求 URL 的参数，该 URL 将返回包含数据 + 数字的 JSON。最后一个数字将用作请求 URL 的
web-scraping - 如何使用剧作家捕捉特定的重定向？
当 Google map 在某种程度上确认某个地点搜索时，它会重定向到特定的 Google 地点 url，否则它会返回 map 搜索结果页面。谷歌地图搜索“manarama”是 https://ww
web-scraping - 如何根据亚马逊的位置抓取数据？
每当我想在 amazon.com 上抓取时，我都会失败。因为产品信息会根据 amazon.com 中的位置而变化这个变化信息如下； 1-价格 2-运费 3-海关费用 4-发货状态用selenium
web-scraping - 设置代理隐藏我的IP地址以使用scrapy抓取网页
我正在使用scrapy来抓取网站，现在我需要设置代理处理已发送的请求。谁能帮我在scrapy应用程序中解决这个设置代理。如果有，也请提供任何示例链接。我需要解决这个请求来自哪个 IP 的问题。最佳答
web-scraping - 如何防止在抓取亚马逊时被列入黑名单
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 2年前关闭。 Improve thi
screen-scraping - 识别恶意网络爬虫
我想知道是否有任何技术可以识别收集信息以供非法使用的网络爬虫。通俗地说，数据盗窃是为了创建一个网站的副本。理想情况下，该系统会检测来自未知来源的抓取模式(如果 Google 抓取工具不在列表中，等等
web-scraping - 使用Disqus从网站检索评论
我想编写一个抓取脚本来检索cnn文章中的评论。例如，本文:http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
web-scraping - 如何提取Instagram数据
我正在尝试构建Instagram帐户的Microsoft Access数据库，并希望提取以下数据以及其他信息: 帐户名关注者数量关注的人数帖子数(及其日期) 图片的赞数图片上的评论数量我在构
web-scraping - 如何在Scrapy中处理429个请求过多？
我正在尝试运行一个爬虫，其输出日志如下所示: 2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring respo
web-scraping - 网络抓取是否合法？
我想抓取一个网站，该网站的网页上有他们商店的所有联系方式，我可以手动记下这些信息，因此抓取该网站是合法的还是非法的。最佳答案是的，除非您不道德地使用它。 Web 抓取就像世界上的任何工具一样。您可
screen-scraping - 如何使用jsoup用span标签替换单词？
假设我有以下 html: I am going by flying mr tt 文本节点中任何等于或大
web-scraping - 有什么方法可以提取网页收到的浏览量？
寻找任何工具(最好在 python 中)来提取特定网页收到的浏览次数。如果没有，也很方便知道我是否可以获得任何其他网页特定的分析(例如列出的那个) 最佳答案除非您拥有此特定网页，否则无法查看它获得了
screen-scraping - 抓取网页内容
我刚刚开始研究这个，我想将我的 Netgear 路由器 ( http://192.168.0.1/setup.cgi?next_file=stattbl.htm ) 统计数据刮到一个 csv 文件中。
web-scraping - 防止网页抓取
我目前是开发包含前端客户端的应用程序的团队的一员。我们通过这个客户端发送用户数据，每个用户都有一个用户 ID，客户端通过 RESTful API 与我们的服务器对话，向服务器询问数据。例如，假设我
web-scraping - 下载所有已发表的摘要
有谁知道我如何轻松下载所有已发表的文章摘要？我正在做一个文本挖掘项目。我能找到的最接近的一个可以在给定 pmid 的情况下一次下载一个摘要，但这对我的目的来说太慢了，因为我必须一次下载一个。最佳答
web-scraping - 使用美丽汤的请求被阻止
当我使用Beautiful Soup发出请求时，我被阻止为“机器人”。 import requests from bs4 import BeautifulSoup reddit1Link = requ
screen-scraping - 允许抓取结果的搜索引擎？
由于网站抓取 Google、Bing 等违反了他们的服务条款，我想知道是否有任何搜索引擎可以抓取结果？最佳答案为什么要刮？为什么不使用支持的 API？ http://code.google.com
web-scraping - 从链接创建摘要
许多页面(facebook、google+ 等)都有一个功能，可以创建带有标题、图像和来自链接的一些文本的摘要。我试图找出是否有任何关于如何执行此类功能的库或指南，但我的搜索结果根本没有帮助。我知道

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 网页抓取 : scrape multiple webs by Python