gpt4 book ai didi

python - 抓取跨多个页面的数据时遇到问题

转载 作者:太空宇宙 更新时间:2023-11-03 14:23:23 25 4
gpt4 key购买 nike

我用 python 编写了一个脚本来从网页获取数据。该网站通过 60 个页面显示其内容。我的抓取工具可以解析第二页的数据。当我尝试更改 payload 参数中的页码或创建循环以从少数页面获取数据时,它会立即中断。我怎样才能以这种方式纠正我的脚本,以便它可以从所有页面,而不仅仅是从第二页获取数据。提前致谢。

  1. 访问包含数据的网站的链接:Page_link
  2. 链接替换为以下脚本:page_url

我想,分页号在这里:

ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages:1

这是完整的脚本(仅适用于第 2 页):

import requests
from bs4 import BeautifulSoup

url = "Link to replace with the above url" ##Replace the number 2 links here

formdata = {
'searchEntity':'FundServiceProvider',
'searchType':'Name',
'searchText':'',
'registers':'6,29,44,45',
'AspxAutoDetectCookieSupport':'1'
}
req = requests.get(url,params=formdata,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(req.text,"lxml")

VIEWSTATE = soup.select("#__VIEWSTATE")[0]['value']
EVENTVALIDATION = soup.select("#__EVENTVALIDATION")[0]['value']

payload = {
'__EVENTTARGET':'','__EVENTARGUMENT':'','__LASTFOCUS':'','__VIEWSTATE':VIEWSTATE,'__SCROLLPOSITIONX':'0','__SCROLLPOSITIONY':'541','__EVENTVALIDATION':EVENTVALIDATION,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260','ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
}

with requests.session() as session:
session.headers = {"User-Agent":"Mozilla/5.0"}
response = session.post(req.url,data=payload)
soup = BeautifulSoup(response.text,"lxml")
tabd = soup.select(".searchresults")[0]
for items in tabd.select("tr")[:-1]:
data = ' '.join([item.text for item in items.select("th,td")])
print(data)

最佳答案

您只需删除有效负载数据的最后 2 个字段:

payload = {
'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE':VIEWSTATE,
'__SCROLLPOSITIONX':'0',
'__SCROLLPOSITIONY':'541',
'__EVENTVALIDATION':EVENTVALIDATION,
'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1
}

而不是

payload = {
'__EVENTTARGET':'',
'__EVENTARGUMENT':'',
'__LASTFOCUS':'',
'__VIEWSTATE':VIEWSTATE,
'__SCROLLPOSITIONX':'0',
'__SCROLLPOSITIONY':'541',
'__EVENTVALIDATION':EVENTVALIDATION,
'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages':1,
'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.x':'260',
'ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$btnNext.y':'11'
}

然后更新ctl00$cphRegistersMasterPage$gvwSearchResults$ctl18$ddlPages值将获得正确的页面数据

关于python - 抓取跨多个页面的数据时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47796867/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com