gpt4 book ai didi

python - 如何将多个网站页面的抓取结果保存到 CSV 文件中?

转载 作者:行者123 更新时间:2023-12-01 06:34:17 24 4
gpt4 key购买 nike

我正在尝试使用 selenium 和 beautifulsoup 从亚马逊网站(仅 ASIN)抓取一些 ASIN(比如说 600 个 ASIN)。我的主要问题是如何将所有抓取的数据保存到 CSV 文件中?我尝试过一些方法,但它只保存最后抓取的页面。

这是代码:

from time import sleep
import requests
import time
import json
import re
import sys
import numpy as np
from selenium import webdriver
import urllib.request
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
import pandas as pd
from urllib.request import urlopen


i = 1
while(True):
try:
if i == 1:
url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page=1"
else:
url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page={}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

#print page url
print(url)

#rest of the scraping code
driver = webdriver.Chrome()
driver.get(url)

HTML = driver.page_source
HTML1=driver.page_source
soup = BeautifulSoup(HTML1, "html.parser")
styles = soup.find_all(name="div", attrs={"data-asin":True})
res1 = [i.attrs["data-asin"] for i in soup.find_all("div") if i.has_attr("data-asin")]
print(res1)
data_record.append(res1)
#driver.close()

#don't overflow website
sleep(1)

#increase page number
i += 1
if i == 3:
print("STOP!!!")
break
except:
break



最佳答案

删除目前似乎未使用的项目可能是解决方案

import csv
import bs4
import requests
from selenium import webdriver
from time import sleep


def retrieve_asin_from(base_url, idx):
url = base_url.format(idx)
r = requests.get(url)
soup = bs4.BeautifulSoup(r.content, 'html.parser')

with webdriver.Chrome() as driver:
driver.get(url)
HTML1 = driver.page_source
soup = bs4.BeautifulSoup(HTML1, "html.parser")
res1 = [i.attrs["data-asin"]
for i in soup.find_all("div") if i.has_attr("data-asin")]
sleep(1)
return res1


url = "https://www.amazon.es/s?k=doll&i=toys&rh=n%3A599385031&dc&page={}"
data_record = [retrieve_asin_from(url, i) for i in range(1, 4)]

combined_data_record = combine_records(data_record) # fcn to write

with open('asin_data.csv', 'w', newline='') as fd:
csvfile = csv.writer(fd)
csvfile.writerows(combined_data_record)

关于python - 如何将多个网站页面的抓取结果保存到 CSV 文件中?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59750971/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com