gpt4 book ai didi

python - 无法抓取所有评论

转载 作者:行者123 更新时间:2023-12-01 07:47:06 25 4
gpt4 key购买 nike

我正在尝试抓取这个website并试图获得评论,但我遇到了一个问题,

  • 该页面仅加载 50 条评论。
  • 要加载更多内容,您必须点击“显示更多评论”,但我不知道如何获取所有数据,因为没有页面链接,而且“显示更多评论”没有可供探索的 URL,地址保持不变。

url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"

import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
a = []

url = requests.get(url)
html = url.text
soup = BeautifulSoup(html, "html.parser")

table = soup.findAll("div", {"class":"review-comments"})
#print(table)
for x in table:
a.append(x.text)
df = pd.DataFrame(a)
df.to_csv("review.csv", sep='\t')

我知道这不是漂亮的代码,但我只是想先获取评论文本。请帮忙。因为我对此不太陌生。

最佳答案

查看该网站,“显示更多评论”按钮会进行 ajax 调用并返回附加信息,您所要做的就是找到它的链接并向其发送 get 请求(我已经通过一些简单的操作完成了)正则表达式):

import requests
import re
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) snap Chromium/74.0.3729.169 Chrome/74.0.3729.169 Safari/537.36"
}
url = "https://www.capterra.com/p/134048/HiMama-Preschool-Child-Care-App/#reviews"
Data = []
#Each page equivalant to 50 comments:
MaximumCommentPages = 3

with requests.Session() as session:
info = session.get(url)
#Get product ID, needed for getting more comments
productID = re.search(r'"product_id":(\w*)', info.text).group(1)
#Extract info from main data
soup = BeautifulSoup(info.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)
#Number of pages to get:
#Get additional data:
params = {
"page": "",
"product_id": productID
}
while(MaximumCommentPages > 1): # number 1 because one of them was the main page data which we already extracted!
MaximumCommentPages -= 1
params["page"] = str(MaximumCommentPages)
additionalInfo = session.get("https://www.capterra.com/gdm_reviews", params=params)
print(additionalInfo.url)
#print(additionalInfo.text)
#Extract info for additional info:
soup = BeautifulSoup(additionalInfo.content, "html.parser")
table = soup.findAll("div", {"class":"review-comments"})
for x in table:
Data.append(x)

#Extract data the old fashioned way:
counter = 1
with open('review.csv', 'w') as f:
for one in Data:
f.write(str(counter))
f.write(one.text)
f.write('\n')
counter += 1

请注意我如何使用 session 来为 ajax 调用保留 cookie。

编辑1:您可以多次重新加载网页并再次调用ajax以获取更多数据。

编辑2:使用您自己的方法保存数据。

编辑 3:更改了一些内容,现在可以为您获取任意数量的页面,并使用 good' ol open() 保存到文件

关于python - 无法抓取所有评论,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56409474/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com