gpt4 book ai didi

python - Scrapy splash spider 不跟随链接获取新页面

转载 作者:太空宇宙 更新时间:2023-11-04 02:05:14 25 4
gpt4 key购买 nike

我正在从使用 Javascript 链接到新页面的页面中获取数据。我正在使用 Scrapy + splash 来获取这些数据,但是,由于某种原因,链接没有被跟踪。

这是我的蜘蛛的代码:

import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash, args)
local javascript = args.javascript
assert(splash:runjs(javascript))
splash:wait(0.5)

return {
html = splash:html()
}
end
"""


page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=0&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"


class MySpider(scrapy.Spider):
name = "foo_crawler"
download_delay = 5.0

custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
#'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'
}




def start_requests(self):
yield SplashRequest(url=page_url,
callback=self.parse
)



# Parses first page of ticker, and processes all maturities
def parse(self, response):
try:
self.extract_data_from_page(response)

href = response.xpath('//div[@class="paging"]/p/a[contains(text(),"Next")]/@href')
print("href: {0}".format(href))

if href:
javascript = href.extract_first().split(':')[1].strip()

yield SplashRequest(response.url, self.parse,
cookies={'store_language':'en'},
endpoint='execute',
args = {'lua_source': script, 'javascript': javascript })

except Exception as err:
print("The following error occured: {0}".format(err))



def extract_data_from_page(self, response):
url = response.url
page_num = url.split('page=')[1].split('&')[0]
print("extract_data_from_page() called on page: {0}.".format(url))
filename = "page_{0}.html".format(page_num)
with open(filename, 'w') as f:
f.write(response.text)




def handle_error(self, failure):
print("Error: {0}".format(failure))

只有第一页被抓取,我无法通过页面底部的链接“点击”来获取后续页面。

如何解决此问题以便我可以点击页面底部给出的页面?

最佳答案

您的代码看起来不错,唯一的问题是由于生成的请求具有相同的 url,因此重复过滤器会忽略它们。只需取消注释 DUPEFILTER_CLASS 并重试。

custom_settings = {
...
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
}

编辑:要在不运行 javascript 的情况下浏览数据页面,您可以这样做:

page_url = "https://www.londonstockexchange.com/exchange/prices-and-markets/stocks/exchange-insight/trade-data.html?page=%s&pageOffBook=0&fourWayKey=GB00B6774699GBGBXAMSM&formName=frmRow&upToRow=-1"

page_number_regex = re.compile(r"'frmRow',(\d+),")
...
def start_requests(self):
yield SplashRequest(url=page_url % 0,
callback=self.parse)
...
if href:
javascript = href.extract_first().split(':')[1].strip()
matched = re.search(self.page_number_regex, javascript)
if matched:
yield SplashRequest(page_url % matched.group(1), self.parse,
cookies={'store_language': 'en'},
endpoint='execute',
args={'lua_source': script, 'javascript': javascript})

虽然我期待使用 javascript 的解决方案。

关于python - Scrapy splash spider 不跟随链接获取新页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54867680/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com