gpt4 book ai didi

python - 如何抓取以下链接的不可见页面数据(即 pg :No 11 , 12, 13),以下代码可以工作到 10 页

转载 作者:行者123 更新时间:2023-12-01 01:57:06 25 4
gpt4 key购买 nike

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import csv

page = urlopen("https://www.flipkart.com/mobiles/samsung~brand/pr?count=40&p%5B%5D=sort%3Drecency_desc&page=1&sid=tyy%2F4io&viewType=list&wid=1.productCard.PMU_V2")
bsObj = BeautifulSoup(page, 'html.parser')
# List to store Next Page URL's.
nxtPageLink = []
# Extraction of Next Page URL.
for nxtLink in bsObj.findAll(class_="_33m_Yg"):
completeUrl = ("https://www.flipkart.com" + nxtLink.attrs['href'])
nxtPageLink.append(completeUrl)
# List to store Scraped Product Data.
URL = []
# Extraction of Product Data from URL.
for i in nxtPageLink:
url = urlopen(i)
bs= BeautifulSoup(url, 'html.parser')

for link in bs.findAll(class_="_1UoZlX"):
urlBuild = ("https://www.flipkart.com" + link.attrs['href'])
URL.append(urlBuild)

columnsTitles = ['Link']
test_df = pd.DataFrame({ 'Link': URL})
pd.set_option('display.max_colwidth',0)
print(test_df.info())
test_df

在这里,我试图抓取 13 个页面中的所有产品网址,但我只能抓取 10 页数据...请帮助我

最佳答案

这是因为并非所有页码都显示在首页。

抓取工具应该不断获取当前页面的数据并打开下一页,直到结束。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
import csv

# List to store Scraped Product Data.
URL = []

# Start page's url
pageUrl="https://www.flipkart.com/mobiles/samsung~brand/pr?count=40&p%5B%5D=sort%3Drecency_desc&page=1&sid=tyy%2F4io&viewType=list&wid=1.productCard.PMU_V2"

while True:
page = urlopen(pageUrl)
bsObj = BeautifulSoup(page, 'html.parser')

# Extraction of Product Data from URL.
for link in bsObj.findAll(class_="_1UoZlX"):
urlBuild = ("https://www.flipkart.com" + link.attrs['href'])
URL.append(urlBuild)

# Get Next page's url, if can't break loop
nxtLink=bsObj.find(class_="_2kUstJ", text="Next")
if nxtLink == None:
break

# Get next page's url
pageUrl = ("https://www.flipkart.com" + nxtLink.a.attrs['href'])

columnsTitles = ['Link']
test_df = pd.DataFrame({ 'Link': URL})
pd.set_option('display.max_colwidth',0)
print(test_df.info())
test_df

在本例中,test_df 包含 301 行,

print(test_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 301 entries, 0 to 300
Data columns (total 1 columns):
Link 301 non-null object
dtypes: object(1)
memory usage: 2.4+ KB
None

关于python - 如何抓取以下链接的不可见页面数据(即 pg :No 11 , 12, 13),以下代码可以工作到 10 页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50036543/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com