gpt4 book ai didi

python - 使用 scrapy 通过 javascript 按钮和 ajax 请求抓取 asp.net 网站

转载 作者:太空宇宙 更新时间:2023-11-03 13:50:27 34 4
gpt4 key购买 nike

我一直试图从 asp.net 网站上抓取一些日期,起始页面应该是以下页面: http://www.e3050.com/Items.aspx?cat=SON

首先,我想每页显示 50 个项目(来自 select 元素)其次,我想对页面进行分页。

我为每页 50 个项目尝试了以下代码,但没有成功:

start_urls = ["http://www.e3050.com/Items.aspx?cat=SON"]    
def parse(self, response):
requests = []
hxs = HtmlXPathSelector(response)

# Check if there's more than 1 page
if len(hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()) > 0:
# Get last page number
last_page = hxs.select('//span[@id="ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_PageSize"]/text()').extract()[0]
i = 1

# preparing requests for each page
while i < (int(last_page) / 5) + 1:
requests.append(Request("http://www.e3050.com/Items.aspx?cat=SON", callback=self.parse_product))
i +=1

# posting form date (50 items and next page button)
requests.append(FormRequest.from_response(
response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl':'50',
'__EVENTTARGET':'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'},
callback=self.parse_product,
dont_click=True
)
)

for request in requests:
yield request

最佳答案

检查这里是一个精确的解决方案..

在解析方法中每页选择 50 个产品

在 page_rs_50 中处理分页

start_urls = ['http://www.e3050.com/Items.aspx?cat=SON']
pro_urls = [] # all product Urls

def parse(self, response): # select 50 products on each page
yield FormRequest.from_response(response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)'},
meta={'curr': 1, 'total': 0, 'flag': True},
dont_click=True,
callback=self.page_rs_50)

def page_rs_50(self, response): # paginate the pages
hxs = HtmlXPathSelector(response)
curr = int(response.request.meta['curr'])
total = int(response.request.meta['total'])
flag = response.request.meta['flag']
self.pro_urls.extend(hxs.select(
"//td[@class='name']//a[contains(@id,'ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_itemslv_ctrl')]/@href"
).extract())
if flag:
total = hxs.select(
"//span[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_lbl_pagesizeBtm']/text()").re('\d+')[0]
if curr < total:
curr += 1
yield FormRequest.from_response(response,
formdata={'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pagesddl': '50',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$sortddl': 'Price(ASC)',
'ctl00$ctl00$ScriptManager1': 'ctl00$ctl00$ScriptManager1|ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01'
, '__EVENTTARGET': 'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$pager1$ctl00$ctl01',
'ctl00$ctl00$ContentPlaceHolder1$ItemListPlaceHolder$hfVSFileName': hxs.select(
".//input[@id='ctl00_ctl00_ContentPlaceHolder1_ItemListPlaceHolder_hfVSFileName']/@value").extract()[
0]},
meta={'curr': curr, 'total': total, 'flag': False},
dont_click=True,
callback=self.page_rs_50
)
else:
for pro in self.pro_urls:
yield Request("http://www.e3050.com/%s" % pro,
callback=self.parse_product)


def parse_product(self, response):
pass
#TODO Implementation Required For Parsing

关于python - 使用 scrapy 通过 javascript 按钮和 ajax 请求抓取 asp.net 网站,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10218581/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com