gpt4 book ai didi

python - 无法让 Scrapy 跟踪链接

转载 作者:行者123 更新时间:2023-11-28 20:27:55 24 4
gpt4 key购买 nike

我正在尝试抓取一个网站,但我无法抓取到链接,而且我没有收到任何 Python 错误,而且我发现 Wireshark 没有任何进展。我认为它可能是正则表达式,但我尝试使用“.*”尝试访问任何链接,但它也不起作用。虽然方法“parse”确实有效,但我需要遵循“sinopsis.aspx”和回调 parse_peliculas。

编辑:评论 parse 方法使规则生效...parse_peliculas 开始运行,我现在要做的是将 parse 方法更改为另一个名称并制定一个带有回调的规则,但我仍然无法让它工作。

这是我的爬虫代码:

import re

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from Cinesillo.items import CinemarkItem, PeliculasItem

class CinemarkSpider(CrawlSpider):
name = 'cinemark'
allowed_domains = ['cinemark.com.mx']
start_urls = ['http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=555',
'http://www.cinemark.com.mx/smartphone/iphone/vercartelera.aspx?fecha=&id_theater=528']


rules = (Rule(SgmlLinkExtractor(allow=(r'sinopsis.aspx.*', )), callback='parse_peliculas', follow=True),)

def parse(self, response):
item = CinemarkItem()
hxs = HtmlXPathSelector(response)
cine = hxs.select('(//td[@class="title2"])[1]')
direccion = hxs.select('(//td[@class="title2"])[2]')

item['nombre'] = cine.select('text()').extract()
item['direccion'] = direccion.select('text()').extract()
return item

def parse_peliculas(self, response):
item = PeliculasItem()
hxs = HtmlXPathSelector(response)
titulo = hxs.select('//td[@class="pop_up_title"]')
item['titulo'] = titulo.select('text()').extract()
return item

谢谢

最佳答案

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html

关于python - 无法让 Scrapy 跟踪链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7045883/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com