- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
from bs4 import BeautifulSoup
import requests
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
pg = url + '?page=' + str(pg)
soup = BeautifulSoup(page.content, 'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.text)
我想从https://uk.trustpilot.com/review/thread.com中抓取排名、评论和评论日期,但是,我不知道如何从多个页面中抓取并为抓取结果制作一个 pandas DataFrame
最佳答案
您好,您需要向每个页面发送请求,然后处理响应。此外,由于某些项目不能直接作为标签中的文本提供,因此您要么从 javascript 中获取它(我使用 json 加载这样的日期),要么从类名中获取它(我得到这样的评级)。
from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
输出
Title Content Date Rating
0 I ordered a jacket 2 weeks ago I ordered a jacket 2 weeks ago. Still hasn't ... 2019-01-13 1
1 I've used this service for many years… I've used this service for many years and get ... 2018-12-31 4
2 Great website Great website, tailored recommendations, and e... 2018-12-19 5
3 I was excited by the prospect offered… I was excited by the prospect offered by threa... 2018-12-18 1
4 Thread set the benchmark for customer service Firstly, their customer service is second to n... 2018-12-12 5
5 It's a good idea It's a good idea. I am in between sizes and d... 2018-12-02 3
6 Great experience so far Great experience so far. Big choice of clothes... 2018-10-31 5
7 Absolutely love using Thread.com Absolutely love using Thread.com. As a man wh... 2018-10-31 5
8 I'd like to give Thread a one star… I'd like to give Thread a one star review, but... 2018-10-30 2
9 Really enjoying the shopping experience… Really enjoying the shopping experience on thi... 2018-10-22 5
10 The only way I buy clothes I absolutely love Thread. I've been surviving ... 2018-10-15 5
11 Excellent Service Excellent ServiceQuick delivery, nice items th... 2018-07-27 5
12 Convenient way to order clothes online Convenient way to order clothes online, and gr... 2018-07-05 5
13 Superb - would thoroughly recommend Recommendations have been brilliant - no more ... 2018-06-24 5
14 First time ordering from Thread First time ordering from Thread - Very slow de... 2018-06-22 1
15 Some of these criticisms are just madness I absolutely love thread.com, and I can't reco... 2018-05-28 5
16 Top service! Great idea and fantastic service. I just recei... 2018-05-17 5
17 Great service Great service. Great clothes which come well p... 2018-05-05 5
18 Thumbs up Easy, straightforward and very good costumer s... 2018-04-17 5
19 Good idea, ruined by slow delivery I really love the concept and the ordering pro... 2018-04-08 3
20 I love Thread I have been using thread for over a year. It i... 2018-03-12 5
21 Clever simple idea but.. low quality clothing Clever simple idea but.. low quality clothingL... 2018-03-12 2
22 Initially I was impressed.... Initially I was impressed with the Thread shop... 2018-02-07 2
23 Happy new customer Joined the site a few weeks ago, took a short ... 2018-02-06 5
24 Style tips for mature men I'm a man of mature age, let's say a "baby boo... 2018-01-31 5
25 Every shop, every item and in one place Simple, intuitive and makes online shopping a ... 2018-01-28 5
26 Fantastic experience all round Fantastic experience all round. Quick to regi... 2018-01-28 5
27 Superb "all in one" shopping experience … Superb "all in one" shopping experience that i... 2018-01-25 5
28 Great for time poor people who aren’t fond of ... Rally love this company. Super useful for thos... 2018-01-22 5
29 Really is worth trying! Quite cautious at first, however, love the way... 2018-01-10 4
30 14 days for returns is very poor given … 14 days for returns is very poor given most co... 2017-12-20 3
31 A great intro to online clothes … A great intro to online clothes shopping. Usef... 2017-12-15 5
32 I was skeptical at first I was skeptical at first, but the service is s... 2017-11-16 5
33 seems good to me as i hate to shop in … seems good to me as i hate to shop in stores, ... 2017-10-23 5
34 Great concept and service Great concept and service. This service has be... 2017-10-17 5
35 Slow dispatch My Order Dispatch was extremely slow compared ... 2017-10-07 1
36 This company sends me clothes in boxes This company sends me clothes in boxes! I find... 2017-08-28 5
37 I've been using Thread for the past six … I've been using Thread for the past six months... 2017-08-03 5
38 Thread Thread, this site right here is literally the ... 2017-06-22 5
39 good concept The website is a good concept in helping buyer... 2017-06-14 3
注意:虽然我能够“破解”我的方式来获取该站点的结果,但最好使用 selenium 来抓取动态页面。
编辑自动找出页数的代码
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
关于python - 网页抓取 : scrape multiple webs by Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54174187/
虽然 HTML Scraping 从我所看到的内容中得到了很好的记录,并且我了解它的概念和实现,但是从隐藏在身份验证表单后面的内容中进行抓取的最佳方法是什么。我指的是从我合法访问的内容中抓取,因此我正
使用 Python、Selenium、Sublime 和 Firefox:我正在从这个网站上抓取链接,并想将抓取的页面(作为 html 文件)保存到一个文件夹中。但是,我已经工作了好几天,试图将这些
我需要构建一个 python 脚本,旨在抓取网页以检索“显示更多”按钮中的数字。 此数字将用作请求 URL 的参数,该 URL 将返回包含数据 + 数字的 JSON。最后一个数字将用作请求 URL 的
当 Google map 在某种程度上确认某个地点搜索时,它会重定向到特定的 Google 地点 url,否则它会返回 map 搜索结果页面。 谷歌地图搜索“manarama”是 https://ww
每当我想在 amazon.com 上抓取时,我都会失败。因为产品信息会根据 amazon.com 中的位置而变化 这个变化信息如下; 1-价格 2-运费 3-海关费用 4-发货状态 用selenium
我正在使用scrapy来抓取网站,现在我需要设置代理处理已发送的请求。谁能帮我在scrapy应用程序中解决这个设置代理。如果有,也请提供任何示例链接。我需要解决这个请求来自哪个 IP 的问题。 最佳答
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。 想改善这个问题吗?更新问题,使其成为 on-topic对于堆栈溢出。 2年前关闭。 Improve thi
我想知道是否有任何技术可以识别收集信息以供非法使用的网络爬虫。通俗地说,数据盗窃是为了创建一个网站的副本。 理想情况下,该系统会检测来自未知来源的抓取模式(如果 Google 抓取工具不在列表中,等等
我想编写一个抓取脚本来检索cnn文章中的评论。例如,本文:http://www.cnn.com/2012/01/19/politics/gop-debate/index.html?hpt=hp_t1
我正在尝试构建Instagram帐户的Microsoft Access数据库,并希望提取以下数据以及其他信息: 帐户名 关注者数量 关注的人数 帖子数(及其日期) 图片的赞数 图片上的评论数量 我在构
我正在尝试运行一个爬虫,其输出日志如下所示: 2017-04-25 20:22:22 [scrapy.spidermiddlewares.httperror] INFO: Ignoring respo
我想抓取一个网站,该网站的网页上有他们商店的所有联系方式,我可以手动记下这些信息,因此抓取该网站是合法的还是非法的。 最佳答案 是的,除非您不道德地使用它。 Web 抓取就像世界上的任何工具一样。您可
假设我有以下 html: I am going by flying mr tt 文本节点中任何等于或大
寻找任何工具(最好在 python 中)来提取特定网页收到的浏览次数。如果没有,也很方便知道我是否可以获得任何其他网页特定的分析(例如列出的那个) 最佳答案 除非您拥有此特定网页,否则无法查看它获得了
我刚刚开始研究这个,我想将我的 Netgear 路由器 ( http://192.168.0.1/setup.cgi?next_file=stattbl.htm ) 统计数据刮到一个 csv 文件中。
我目前是开发包含前端客户端的应用程序的团队的一员。 我们通过这个客户端发送用户数据,每个用户都有一个用户 ID,客户端通过 RESTful API 与我们的服务器对话,向服务器询问数据。 例如,假设我
有谁知道我如何轻松下载所有已发表的文章摘要?我正在做一个文本挖掘项目。 我能找到的最接近的一个可以在给定 pmid 的情况下一次下载一个摘要,但这对我的目的来说太慢了,因为我必须一次下载一个。 最佳答
当我使用Beautiful Soup发出请求时,我被阻止为“机器人”。 import requests from bs4 import BeautifulSoup reddit1Link = requ
由于网站抓取 Google、Bing 等违反了他们的服务条款,我想知道是否有任何搜索引擎可以抓取结果? 最佳答案 为什么要刮?为什么不使用支持的 API? http://code.google.com
许多页面(facebook、google+ 等)都有一个功能,可以创建带有标题、图像和来自链接的一些文本的摘要。我试图找出是否有任何关于如何执行此类功能的库或指南,但我的搜索结果根本没有帮助。 我知道
我是一名优秀的程序员,十分优秀!