gpt4 book ai didi

python - 我正在尝试抓取数据,但它只获取 10 页的数据,而有 26 页

转载 作者:行者123 更新时间:2023-12-01 09:31:25 25 4
gpt4 key购买 nike

  import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list")

c = r.content

soup = BeautifulSoup(c,"html.parser")

all = soup.find_all("div",{"class":"col _2-gKeQ"})

page_nr=soup.find_all("a",{"class":"_33m_Yg"})[-1].text
print(page_nr,"number of pages were found")



#all[0].find("div",{"class":"_1vC4OE _2rQ-NK"}).text



l=[]
base_url="https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page=1&q=laptop&sid=6bo%2Fb5g&viewType=list"
for page in range(0,int(page_nr)*10,10):
print( )
r=requests.get(base_url+str(page)+".html")
c=r.content
#c=r.json()["list"]
soup=BeautifulSoup(c,"html.parser")

for item in all:
d ={}
#price
d["Price"] = item.find("div",{"class":"_1vC4OE _2rQ-NK"}).text
#Name
d["Name"] = item.find("div",{"class":"_3wU53n"}).text

for li in item.find_all("li",{"class":"_1ZRRx1"}):
if " EMI" in li.text:
d["EMI"] = li.text
else:
d["EMI"] = None

for li1 in item.find_all("li",{"class":"_1ZRRx1"}):
if "Special " in li1.text:
d["Special Price"] = li1.text
else:
d["Special Price"] = None

for val in item.find_all("li",{"class":"tVe95H"}):
if "Display" in val.text:
d["Display"] = val.text

elif "Warranty" in val.text:
d["Warrenty"] = val.text

elif "RAM" in val.text:
d["Ram"] = val.text



l.append(d)




import pandas
df = pandas.DataFrame(l)

最佳答案

这可能适用于标准分页

i = 1
items_parsed = set()
loop = True
base_url = "https://www.flipkart.com/search?as=on&as-pos=1_1_ic_lapto&as-show=on&otracker=start&page={}&q=laptop&sid=6bo%2Fb5g&viewType=list"
while True:
page = requests.get(base_url.format(i))
items = requests.get(#yourelements#)
if not items:
break
for item in items:
#Scrap your item and once you sucessfully done the scrap, return the url of the parsed item into url_parsed (details below code) for example:
url_parsed = your_stuff(items)
if url_parsed in items_parsed:
loop = False
items_parsed.add(url_parsed)
if not loop:
break
i += 1

我将您的网址格式化为 ?page=X with base_url.format(i),这样它就可以迭代,直到您在页面上找不到任何项目,或者有时当您达到 max_page 时返回到第 1 页+ 1。

如果在最大页面之上,您获得了已在第一页上解析的项目,则可以声明 set() 并放置您解析的每个项目的 URL,然后检查您是否已解析它们。

请注意,这只是一个想法。

关于python - 我正在尝试抓取数据,但它只获取 10 页的数据,而有 26 页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49939098/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com