gpt4 book ai didi

javascript - Scrapy 只抓取前两页

转载 作者:行者123 更新时间:2023-12-04 10:36:34 26 4
gpt4 key购买 nike

我正在尝试抓取一个网站,但需要在所有页面中使用闪屏,因为它们的内容是动态创建的。现在它只呈现前 2 页,即使总共有 47 页。

这是代码:

import scrapy
from scrapy.http import Request
from scrapy_splash import SplashRequest

class JobsSpider(scrapy.Spider):
name = 'jobs'
start_urls = ['https://jobs.citizensbank.com/search-jobs']

def start_requests(self):
filters_script = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)
return splash:html()
end"""

for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={'lua_source': filters_script})

def parse(self, response):
cars_urls = response.xpath('.//section[@id="search-results-list"]/ul/li/a/@href').extract()
for car_url in cars_urls:
absolute_car_url = response.urljoin(car_url)
yield scrapy.Request(absolute_car_url,
callback=self.parse_car)

script_at_page_1 = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)

next_button = splash:select("a[class=next]")
next_button.mouse_click()
splash:wait(3)
return {
url = splash:url(),
html = splash:html()
}
end"""

script_at_page_2 = """function main(splash)
assert(splash:go(splash.args.url))
splash:wait(3)

next_button = splash:select("a[class=next]")
next_button.mouse_click()
splash:wait(3)
return {
url = splash:url(),
html = splash:html()
}
end"""

script = None
if response.url is not self.start_urls[0]:
script = script_at_page_2
else:
script = script_at_page_1

yield SplashRequest(url=response.url,
callback=self.parse,
endpoint='execute',
args={'lua_source': script})


def parse_car(self, response):
jobtitle = response.xpath('//h1[@itemprop="title"]/text()').extract_first()
location = response.xpath('//span[@class="job-info"]/text()').extract_first()
jobid = response.xpath('//span[@class="job-id job-info"]/text()').extract_first()

yield {'jobtitle': jobtitle,
'location': location,
'jobid': jobid}

我已经用我能想到的各种方式来玩它,但它没有奏效。
我是scrapy的新手,所以任何帮助表示赞赏。

最佳答案

我认为您不需要为此使用 Splash。如果您查看浏览器检查器的网络选项卡,您会看到它在 XHR 下向此 URL 发出请求:

https://jobs.citizensbank.com/search-jobs/results?ActiveFacetID=0&CurrentPage=3&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&Latitude=&Longitude=&ShowRadius=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&CategoryFacetTerm=&CategoryFacetType=&LocationFacetTerm=&LocationFacetType=&KeywordType=&LocationType=&LocationPath=&OrganizationIds=&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=

尝试向此 URL 发出请求并每次更改页面。如果您遇到问题,您可能需要查看 XHR 请求的 header 并复制它们。如果您单击该链接,JSON 将加载到您的浏览器中。因此,只需将第 1 页设置为您的 start_url 并按如下方式覆盖 start_requests:

start_urls = ['https://jobs.citizensbank.com/search-jobs/results?ActiveFacetID=0&CurrentPage={}&RecordsPerPage=15&Distance=50&RadiusUnitType=0&Keywords=&Location=&Latitude=&Longitude=&ShowRadius=False&CustomFacetName=&FacetTerm=&FacetType=0&SearchResultsModuleName=Search+Results&SearchFiltersModuleName=Search+Filters&SortCriteria=0&SortDirection=0&SearchType=5&CategoryFacetTerm=&CategoryFacetType=&LocationFacetTerm=&LocationFacetType=&KeywordType=&LocationType=&LocationPath=&OrganizationIds=&PostalCode=&fc=&fl=&fcf=&afc=&afl=&afcf=']

def start_requests(self):
num_pages = 10
for page in range(1, num_pages):
yield scrapy.Request(self.start_urls[0].format(page), callback=self.parse)

还值得注意的是您可以设置 RecordsPerPage 设置。您可以将其设置得更高,并可能在一页上获取所有记录,或者减少获取所有记录的请求。

关于javascript - Scrapy 只抓取前两页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60153617/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com