gpt4 book ai didi

python - Scrapy不会越过页面

转载 作者:太空宇宙 更新时间:2023-11-03 16:54:29 25 4
gpt4 key购买 nike

Ні!为什么蜘蛛不浏览页面?我使用规则...我做错了什么?它仅适用于一页。这是代码:

# -*- encoding: -*-

class JobSpider(CrawlSpider):
name = 'superjob'
allowed_domains = ['superjob.ru']
start_urls = [
'http://www.superjob.ru/vacancy/search/?t%5B0%5D=4&sbmit=1&period=7'
]

rules = [
Rule(SgmlLinkExtractor(allow='/vacancy/search/?',
restrict_xpaths=(
u'//a[@class="h_border_none"]/<span>следующая</span>')),
callback='parse',
follow=True),
]

def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select(
'//*[@id="ng-app"]/div[2]/div/div[2]/div/div[1]/div[2]/div/div/h2/a')
items = []
for title in titles:
item = JobItem()
item['title'] = title.select('//h2/a/text()').extract()
items.append(item)
# return items

最佳答案

需要解决的 5 件事:

  • restrict_xpaths 应指向分页 block
  • 回调应该调用parse()
  • 使用 LinkExtractorSgmlLinkExtractor 已弃用
  • 使用 xpath() 而不是 select() 一个 response.xpath() 快捷方式
  • 修复内部 XPath 表达式 - 只需获取 text()

修复版本:

# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class JobSpider(CrawlSpider):
name = 'superjob'
allowed_domains = ['superjob.ru']
start_urls = [
'http://www.superjob.ru/vacancy/search/?t%5B0%5D=4&sbmit=1&period=7'
]

rules = [
Rule(LinkExtractor(allow='/vacancy/search/\?', restrict_xpaths=u'//div[@class="Paginator_navnums"]'),
callback='parse_item',
follow=True),
]

def parse_item(self, response):
titles = response.xpath('//*[@id="ng-app"]/div[2]/div/div[2]/div/div[1]/div[2]/div/div/h2/a')
for title in titles:
item = JobItem()
item['title'] = title.xpath('text()').extract()
yield item

关于python - Scrapy不会越过页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35539379/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com