gpt4 book ai didi

python - 如何抓取更多数据

转载 作者:太空宇宙 更新时间:2023-11-03 21:39:08 25 4
gpt4 key购买 nike

我正在尝试下载他们在以下网站上拥有的所有钻石:https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll

计划是获取信息并尝试找出我最喜欢购买的一款(我会做一些回归来找出哪些具有很大的值(value)并选择我最喜欢的)

为此,我编写了第一个爬虫。问题是它似乎只占用了前 60 颗钻石,而不是我在网站上看到的所有钻石。理想情况下,我希望它能够获取所有 100k+ 不同类型的钻石(圆形、垫形等)。如何让它向我提供所有数据?

(我认为这是因为一些新行仅在我向下滚动后加载,但我认为第一次加载超过 60 行,如果我向下滚动到底部,它只显示 1000)

这是我的代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'

url_response = requests.get(url)
soup = BeautifulSoup(url_response.content, "html.parser")

""" Now we have the page as soup

Lets start to get the header"""

headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')

diamondsmessy = soup.find_all('a', class_='grid-row row ')
diamondscleaned = diamondsmessy[1].get_text(";")


"""Create diamonds dataframe with the header; take out the 1st value"""
header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)

""" place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""

for i in range(len(diamondsmessy)):
a = diamondsmessy[i].get_text(";")
b = a.split(";")
del b[4]
a = pd.DataFrame(b, index=header)
b = a.transpose()
diamonds = pd.concat([diamonds, b], ignore_index=True)

print(diamonds)

最佳答案

我已经知道该怎么做了。它并不快,但本质上我需要 Selenium 来向下滚动页面。我仍然被 1000 行困住,所以循环一些东西来更新页面。

为了帮助其他人,代码在这里:

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time

#for fun, let's time this
start = time.time()

"""Define important numbers"""

scroll_pauze_time = 0.5 #delay after scroll
scroll_number = 20 #number of times scrolled per page
pages_visited = 25 #number of times the price is increased

"""Set up the website"""

url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'

url_response = webdriver.Firefox()
url_response.get(url)

#minimum & max carat:
min_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(1)')
min_carat.send_keys('0.8')
min_carat.send_keys(Keys.ENTER)

max_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(2)')
max_carat.send_keys('1.05')
max_carat.send_keys(Keys.ENTER)


#Shapes of diamonds:
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(2) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(4) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(5) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(7) > .shape-filter-button-inner').click()

"""Create diamonds dataframe with the header; take out the 1st value"""
soup = BeautifulSoup(url_response.page_source, "html.parser")

headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')

header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)

"""Start loop, dummy variable j"""
for j in range(pages_visited):

print(j)
url_response.execute_script("window.scrollTo(0, 0)")

#Set the minimum price
if j != 0:
min_price = url_response.find_element_by_css_selector('input[name="minValue"]')

min_price.send_keys(Keys.CONTROL,"a");
min_price.send_keys(Keys.DELETE);

a = diamonds.loc[len(diamonds.count(1))-1,"Price"]
a = a.replace('$','')
a = a.replace(',','')
min_price.send_keys(a)
min_price.send_keys(Keys.ENTER)

#Scroll down
for i in range(scroll_number):
url_response.execute_script("window.scrollTo(0, "+str((i+1)*2000)+')')
time.sleep(scroll_pauze_time)

#Grab data
soup = BeautifulSoup(url_response.page_source, "html.parser")
diamondsmessy = soup.find_all('a', class_='grid-row row ')


""" place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""

for i in range(len(diamondsmessy)):
a = diamondsmessy[i].get_text(";")
b = a.split(";")
del b[4]
a = pd.DataFrame(b, index=header)
b = a.transpose()
diamonds = pd.concat([diamonds, b], ignore_index=True)

diamonds = diamonds.drop_duplicates()
diamonds.to_csv('diamondsoutput.csv')

print(diamonds)

end = time.time()
print("This took "+ str(end-start)+" seconds")

关于python - 如何抓取更多数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53015772/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com