gpt4 book ai didi

python - 无法从延迟加载网站获取某些标签

转载 作者:行者123 更新时间:2023-12-01 03:01:44 25 4
gpt4 key购买 nike

运行我的抓取工具,我可以看到它除了我需要的每所学校的链接之外,还抓取了不必要的链接。不过,我已经创建了正确的 xpath。该网站包含延迟加载方法。也许需要抓取 json 响应。我尝试过:

import requests
from lxml import html

url = "http://www.boarding.org.au/find-a-school"
def LazyLoadWeb(address):
try :
page = requests.get(address, timeout=30)
except Exception:
print('timed out')
else:
tree = html.fromstring(page.text)
titles = tree.xpath('//div[contains(@class,"clearfix")]')
for title in titles:
links=title.xpath('.//a/@href')
for link in links:
print(link)

LazyLoadWeb(url)

最佳答案

你对 json 响应的看法是正确的。该站点使用 Ajax 来填充内容。您需要发出一个 post 请求并简单地从响应中解析 json。

import requests

url = 'http://www.boarding.org.au/ajax-calls/GetSchoolsJson'
payload = {"state": 'null', "schoolType": 2, "orderMode": "ASC", "enableSchoolType": 'false', "loadAll": 'true'}
req = requests.post(url, json=payload)
data = req.json()
for i, item in enumerate(data, start=1):
print(i, item['URL'])
# 1 /schools/details/4/Abbotsleigh
# ...
# 189 /schools/details/83/Yirara-College

关于python - 无法从延迟加载网站获取某些标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43744182/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com