gpt4 book ai didi

python - 无法使用 Scrapy 跟踪链接

转载 作者:行者123 更新时间:2023-12-01 05:59:01 25 4
gpt4 key购买 nike

我创建了一个扩展 CrawlSpider 的蜘蛛,并遵循 http://scrapy.readthedocs.org/en/latest/topics/spiders.html 的建议

问题是我需要解析起始网址(恰好与主机名一致)和它所包含的一些链接。

所以我定义了一个规则,例如: rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True)],但什么也没发生。

然后我尝试定义一组规则,例如: rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_items', follow=True), Rule (SgmlLinkExtractor(allow=['/']),callback='parse_items',follow=True)]。现在的问题是蜘蛛会解析所有内容。

如何告诉蜘蛛解析 _start_url_ 以及它包含的一些链接?

更新:

我尝试重写 parse_start_url 方法,因此现在我可以从起始页获取数据,但它仍然不遵循使用 Rule< 定义的链接:

class ExampleSpider(CrawlSpider):
name = 'TechCrunchCrawler'
start_urls = ['http://techcrunch.com']
allowed_domains = ['techcrunch.com']
rules = [Rule(SgmlLinkExtractor(allow=['/page/d+']), callback='parse_links', follow=True)]

def parse_start_url(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
return self.parse_links(response)

def parse_links(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
articles = []
for i in HtmlXPathSelector(response).select('//h2[@class="headline"]/a'):
article = Article()
article['title'] = i.select('./@title').extract()
article['link'] = i.select('./@href').extract()
articles.append(article)

return articles

最佳答案

我过去也遇到过类似的问题。
我坚持使用 BaseSpider。

试试这个:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapy.contrib.loader import XPathItemLoader

from techCrunch.items import Article


class techCrunch(BaseSpider):
name = 'techCrunchCrawler'
allowed_domains = ['techcrunch.com']

# This gets your start page and directs it to get parse manager
def start_requests(self):
return [Request("http://techcrunch.com", callback=self.parseMgr)]

# the parse manager deals out what to parse and start page extraction
def parseMgr(self, response):
print '++++++++++++++++++++++++parse start url++++++++++++++++++++++++'
yield self.pageParser(response)

nextPage = HtmlXPathSelector(response).select("//div[@class='page-next']/a/@href").extract()
if nextPage:
yield Request(nextPage[0], callback=self.parseMgr)

# The page parser only parses the pages and returns items on each page call
def pageParser(self, response):
print '++++++++++++++++++++++++parse link called++++++++++++++++++++++++'
loader = XPathItemLoader(item=Article(), response=response)
loader.add_xpath('title', '//h2[@class="headline"]/a/@title')
loader.add_xpath('link', '//h2[@class="headline"]/a/@href')
return loader.load_item()

关于python - 无法使用 Scrapy 跟踪链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11356196/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com