gpt4 book ai didi

selenium - 从无限滚动网站抓取内容

转载 作者:行者123 更新时间:2023-12-02 19:24:43 35 4
gpt4 key购买 nike

我正在尝试通过无限滚动来抓取网页中的链接。我只能获取第一个 Pane 上的链接。如何继续以形成所有链接的完整列表。这是我到目前为止所拥有的 -


from bs4 import BeautifulSoup
import requests

html = "https://www.carwale.com/used/cars-for-sale/#sc=-1&so=-1&car=7&pn=8&lcr=168&ldr=0&lir=0"
html_content = requests.get(html).text
soup = BeautifulSoup(html_content, "lxml")
table = soup.find_all("div", {"class": "card-detail-block__data"})

y = []
for i in table:
try:
y.append(i.find("a", {"id":"linkToDetails"}).get('href'))
except AttributeError:
pass

z = [('carwale.com' + item) for item in y]
z

最佳答案

您根本不需要 BeautifulSoup 来连接 ninja HTML dom,因为该网站提供了填充 HTML 的 JSON 响应。仅请求就可以完成工作。如果您通过 Chrome 或 Firefox 开发工具监控“网络”,您将看到对于每次加载,浏览器都会向 API 发送一个 get 请求。使用它我们可以得到干净的 json 数据。

免责声明:我没有检查过该网站是否允许网页抓取。请仔细检查他们的使用条款。我假设你这样做了。

我使用 Pandas 来帮助处理表格数据并将数据导出为 CSV 或您喜欢的任何格式:pip install pandas

import pandas as pd
from requests import Session

# Using Session and a header
req = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) '\
'Chrome/75.0.3770.80 Safari/537.36',
'Content-Type': 'application/json;charset=UTF-8'}
# Add headers
req.headers.update(headers)

BASE_URL = 'https://www.carwale.com/webapi/classified/stockfilters/'

# Monitoring the updates on Network, the params changes in each load
#sc=-1&so=-1&car=7&pn=1
#sc=-1&so=-1&car=7&pn=2&lcr=24&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=3&lcr=48&ldr=0&lir=0
#sc=-1&so=-1&car=7&pn=4&lcr=72&ldr=0&lir=0

params = dict(sc=-1, so=-1, car=7, pn=4, lcr=72, ldr=0, lir=0)

r = req.get(BASE_URL, params=params) #just like requests.get

# Check if everything is okay
assert r.ok, 'We did not get 200'

# get json data
data = r.json()

# Put it in DataFrame
df = pd.DataFrame(data['ResultData'])

print(df.head())

# to go to another page create a function:

def scrap_carwale(params):
r = req.get(BASE_URL, params=params)
if not r.ok:
raise ConnectionError('We did not get 200')
data = r.json()

return pd.DataFrame(data['ResultData'])


# Just first 5 pages :)
for i in range(5):
params['pn']+=1
params['lcr']*=2

dt = scrap_carwale(params)
#append your data
df = df.append(dt)

#print data sample
print(df.sample(10)

# Save data to csv or whatever format
df.to_csv('my_data.csv') #see df.to_?

这是网络 enter image description here

回复: enter image description here

结果示例 enter image description here

关于selenium - 从无限滚动网站抓取内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60237614/

35 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com