gpt4 book ai didi

python - 网页抓取 : scrape multiple webs by Python

转载 作者:行者123 更新时间:2023-11-28 18:04:38 25 4
gpt4 key购买 nike

from bs4 import BeautifulSoup
import requests

url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
pg = url + '?page=' + str(pg)
soup = BeautifulSoup(page.content, 'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.text)

我想从https://uk.trustpilot.com/review/thread.com中抓取排名、评论和评论日期,但是,我不知道如何从多个页面中抓取并为抓取结果制作一个 pandas DataFrame

最佳答案

您好,您需要向每个页面发送请求,然后处理响应。此外,由于某些项目不能直接作为标签中的文本提供,因此您要么从 javascript 中获取它(我使用 json 加载这样的日期),要么从类名中获取它(我得到这样的评级)。

from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

输出

                                                Title                                            Content        Date Rating
0 I ordered a jacket 2 weeks ago I ordered a jacket 2 weeks ago. Still hasn't ... 2019-01-13 1
1 I've used this service for many years… I've used this service for many years and get ... 2018-12-31 4
2 Great website Great website, tailored recommendations, and e... 2018-12-19 5
3 I was excited by the prospect offered… I was excited by the prospect offered by threa... 2018-12-18 1
4 Thread set the benchmark for customer service Firstly, their customer service is second to n... 2018-12-12 5
5 It's a good idea It's a good idea. I am in between sizes and d... 2018-12-02 3
6 Great experience so far Great experience so far. Big choice of clothes... 2018-10-31 5
7 Absolutely love using Thread.com Absolutely love using Thread.com. As a man wh... 2018-10-31 5
8 I'd like to give Thread a one star… I'd like to give Thread a one star review, but... 2018-10-30 2
9 Really enjoying the shopping experience… Really enjoying the shopping experience on thi... 2018-10-22 5
10 The only way I buy clothes I absolutely love Thread. I've been surviving ... 2018-10-15 5
11 Excellent Service Excellent ServiceQuick delivery, nice items th... 2018-07-27 5
12 Convenient way to order clothes online Convenient way to order clothes online, and gr... 2018-07-05 5
13 Superb - would thoroughly recommend Recommendations have been brilliant - no more ... 2018-06-24 5
14 First time ordering from Thread First time ordering from Thread - Very slow de... 2018-06-22 1
15 Some of these criticisms are just madness I absolutely love thread.com, and I can't reco... 2018-05-28 5
16 Top service! Great idea and fantastic service. I just recei... 2018-05-17 5
17 Great service Great service. Great clothes which come well p... 2018-05-05 5
18 Thumbs up Easy, straightforward and very good costumer s... 2018-04-17 5
19 Good idea, ruined by slow delivery I really love the concept and the ordering pro... 2018-04-08 3
20 I love Thread I have been using thread for over a year. It i... 2018-03-12 5
21 Clever simple idea but.. low quality clothing Clever simple idea but.. low quality clothingL... 2018-03-12 2
22 Initially I was impressed.... Initially I was impressed with the Thread shop... 2018-02-07 2
23 Happy new customer Joined the site a few weeks ago, took a short ... 2018-02-06 5
24 Style tips for mature men I'm a man of mature age, let's say a "baby boo... 2018-01-31 5
25 Every shop, every item and in one place Simple, intuitive and makes online shopping a ... 2018-01-28 5
26 Fantastic experience all round Fantastic experience all round. Quick to regi... 2018-01-28 5
27 Superb "all in one" shopping experience … Superb "all in one" shopping experience that i... 2018-01-25 5
28 Great for time poor people who aren’t fond of ... Rally love this company. Super useful for thos... 2018-01-22 5
29 Really is worth trying! Quite cautious at first, however, love the way... 2018-01-10 4
30 14 days for returns is very poor given … 14 days for returns is very poor given most co... 2017-12-20 3
31 A great intro to online clothes … A great intro to online clothes shopping. Usef... 2017-12-15 5
32 I was skeptical at first I was skeptical at first, but the service is s... 2017-11-16 5
33 seems good to me as i hate to shop in … seems good to me as i hate to shop in stores, ... 2017-10-23 5
34 Great concept and service Great concept and service. This service has be... 2017-10-17 5
35 Slow dispatch My Order Dispatch was extremely slow compared ... 2017-10-07 1
36 This company sends me clothes in boxes This company sends me clothes in boxes! I find... 2017-08-28 5
37 I've been using Thread for the past six … I've been using Thread for the past six months... 2017-08-03 5
38 Thread Thread, this site right here is literally the ... 2017-06-22 5
39 good concept The website is a good concept in helping buyer... 2017-06-14 3

注意:虽然我能够“破解”我的方式来获取该站点的结果,但最好使用 selenium 来抓取动态页面。

编辑自动找出页数的代码

from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)

关于python - 网页抓取 : scrape multiple webs by Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54174187/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com