gpt4 book ai didi

python - Scrapy Spider 不跟踪链接

转载 作者:行者123 更新时间:2023-11-28 16:32:00 28 4
gpt4 key购买 nike

我正在编写一个 scrapy 蜘蛛从主页上抓取今天纽约时报的文章,但由于某种原因它不跟踪任何链接。当我在 scrapy shell http://www.nytimes.com 中实例化链接提取器时,它使用 le.extract_links(response) 成功提取了文章网址列表,但是我无法使用我的爬网命令(scrapy crawl nyt -o out.json)来抓取除主页以外的任何内容。我有点无计可施了。是不是因为主页没有从解析功能中产生一篇文章?任何帮助是极大的赞赏。

from datetime import date                                                       

import scrapy
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors import LinkExtractor


from ..items import NewsArticle

with open('urls/debug/nyt.txt') as debug_urls:
debug_urls = debug_urls.readlines()

with open('urls/release/nyt.txt') as release_urls:
release_urls = release_urls.readlines() # ["http://www.nytimes.com"]

today = date.today().strftime('%Y/%m/%d')
print today


class NytSpider(scrapy.Spider):
name = "nyt"
allowed_domains = ["nytimes.com"]
start_urls = release_urls
rules = (
Rule(LinkExtractor(allow=(r'/%s/[a-z]+/.*\.html' % today, )),
callback='parse', follow=True),
)

def parse(self, response):
article = NewsArticle()
for story in response.xpath('//article[@id="story"]'):
article['url'] = response.url
article['title'] = story.xpath(
'//h1[@id="story-heading"]/text()').extract()
article['author'] = story.xpath(
'//span[@class="byline-author"]/@data-byline-name'
).extract()
article['published'] = story.xpath(
'//time[@class="dateline"]/@datetime').extract()
article['content'] = story.xpath(
'//div[@id="story-body"]/p//text()').extract()
yield article

最佳答案

我找到了解决问题的方法。我做错了两件事:

  1. 如果我想让 CrawlSpider 自动抓取子链接,我需要子类化 CrawlSpider 而不是 Spider
  2. 在使用 CrawlSpider 时,我需要使用回调函数而不是覆盖 parse。根据文档,覆盖 parse 会破坏 CrawlSpider 功能。

关于python - Scrapy Spider 不跟踪链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30922218/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com