gpt4 book ai didi

python - 使用 beautifulsoup 抓取动态网站

转载 作者:太空宇宙 更新时间:2023-11-04 04:03:51 26 4
gpt4 key购买 nike

我正在抓取网站 nykaa.com,链接是 ( https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=1)。有 25 页,每页动态加载数据。我无法找到数据的来源。此外,当抓取数据时,我只能得到 20 个产品,这些产品变得多余,列表变成 420 个产品。

import requests
from bs4 import BeautifulSoup
import unicodecsv as csv


urls = []
l1 = []


for page in range(1,5):
result = requests.get("https://www.nykaa.com/skin/moisturizers/serums-essence/c/8397?root=nav_3&page_no=" + str(page))
src = result.content

soup = BeautifulSoup(src,'lxml')

for div_tag in soup.find_all("div", class_ = "card-wrapper-container col-xs-12 col-sm-6 col-md-4"):
for div1_tag in soup.find_all("div", class_ = "product-list-box card desktop-cart"):
h2_tag = div1_tag.find("h2").find("span")
price_tag = div1_tag.find("div", class_ = "price-info")
l1 = [h2_tag.get_text(),price_tag.get_text()]
urls.append(l1)

#print(urls)


with open('xyz.csv', 'wb') as myfile:
wr = csv.writer(myfile)
wr.writerows(urls)

上面的代码为我获取了大约 1200 个产品名称和价格的列表,其中只有 30 到 40 个是唯一的,否则都是多余的。我想唯一地获取 25 页的数据,共有 486 个唯一的产品。我还使用 selenium 单击下一页链接,但也没有成功。

最佳答案

这显示在所有页面(包括确定页面数量)中循环发出页面所做的请求(如网络选项卡中所示)。 results 是您可以轻松写入 csv 的列表列表。

import requests, math, csv

page = '1'

def append_new_rows(data):
for i in data:
if 'name' in i:
results.append([i['name'], i['final_price']])

with requests.Session() as s:
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
results_per_page = 20
total_results = r['response']['total_found']
num_pages = math.ceil(total_results/results_per_page)
results = []
append_new_rows(r['response']['products'])

for page in range(2, num_pages + 1):
r = s.get(f'https://www.nykaa.com/gludo/products/list?pro=false&filter_format=v2&app_version=null&client=react&root=nav_3&page_no={page}&category_id=8397').json()
append_new_rows(r['response']['products'])

with open("data.csv", "w", encoding="utf-8-sig", newline='') as csv_file:
w = csv.writer(csv_file, delimiter = ",", quoting=csv.QUOTE_MINIMAL)
w.writerow(['Name','Price'])
for row in results:
w.writerow(row)

关于python - 使用 beautifulsoup 抓取动态网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57732931/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com