gpt4 book ai didi

python - 我如何跳转到 Scrapy 规则中的下一页

转载 作者:太空宇宙 更新时间:2023-11-03 13:45:41 24 4
gpt4 key购买 nike

我已设置规则以从 start_url 获取下一页,但它不起作用,它只抓取 start_urls 页面和该页面中的链接(使用 parseLinks)。它不会转到规则中设置的下一页。

有什么帮助吗?

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy import log
from urlparse import urlparse
from urlparse import urljoin
from scrapy.http import Request

class MySpider(CrawlSpider):
name = 'testes2'
allowed_domains = ['example.com']
start_urls = [
'http://www.example.com/pesquisa/filtro/?tipo=0&local=0'
]

rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//a[@id="seguinte"]/@href')), follow=True),)

def parse(self, response):
sel = Selector(response)
urls = sel.xpath('//div[@id="btReserve"]/../@href').extract()
for url in urls:
url = urljoin(response.url, url)
self.log('URLS: %s' % url)
yield Request(url, callback = self.parseLinks)

def parseLinks(self, response):
sel = Selector(response)
titulo = sel.xpath('h1/text()').extract()
morada = sel.xpath('//div[@class="MORADA"]/text()').extract()
email = sel.xpath('//a[@class="sendMail"][1]/text()')[0].extract()
url = sel.xpath('//div[@class="contentContacto sendUrl"]/a/text()').extract()
telefone = sel.xpath('//div[@class="telefone"]/div[@class="contentContacto"]/text()').extract()
fax = sel.xpath('//div[@class="fax"]/div[@class="contentContacto"]/text()').extract()
descricao = sel.xpath('//div[@id="tbDescricao"]/p/text()').extract()
gps = sel.xpath('//td[@class="sendGps"]/@style').extract()

print titulo, email, morada

最佳答案

您不应该覆盖 CrawlSpiderparse 方法,否则将不会遵循 Rule

请参阅 http://doc.scrapy.org/en/latest/topics/spiders.html#crawling-rules 处的警告

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

关于python - 我如何跳转到 Scrapy 规则中的下一页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21096172/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com