gpt4 book ai didi

python - CrawlSpider 无法解析 Scrapy 中的多页

转载 作者:行者123 更新时间:2023-12-01 03:07:33 24 4
gpt4 key购买 nike

我创建的 CrawlSpider 没有正常工作。它解析第一页,然后停止而不继续到下一页。我做错了什么但无法检测到。希望有人给我一个提示,我应该做什么来纠正它。

“items.py”包括:

from scrapy.item import Item, Field
class CraigslistScraperItem(Item):
Name = Field()
Link = Field()

CrawlSpider 命名为“craigs.py”,其中包含:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from craigslist_scraper.items import CraigslistScraperItem

class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules=(Rule(LinkExtractor(allow = ('sfbay\.craigslist\.org\/search\/npo/.*',
),restrict_xpaths = ('//a[@class="button next"]')),callback = 'parse',follow = True),)
def parse(self, response):
page=response.xpath('//p[@class="result-info"]')
items=[]
for title in page:
item=CraigslistScraperItem()
item["Name"]=title.xpath('.//a[@class="result-title hdrlnk"]/text()').extract()
item["Link"]=title.xpath('.//a[@class="result-title hdrlnk"]/@href').extract()
items.append(item)
return items

最后,我用来获取 CSV 输出的命令是:

scrapy crawl craigs -o items.csv -t csv

顺便说一句,我首先尝试使用“parse_item”,但没有找到任何响应,这就是为什么我使用“parse”方法。提前致谢。

最佳答案

使用 scrapy.CrawlSpider 时,不要将回调方法命名为 parse。来自 Scrapy documentation :

When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work.

此外,您不需要将项目附加到列表中,因为您已经使用 Scrapy Items并且可以简单地产生元素。此代码应该有效:

# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from craigslist_scraper.items import CraigslistScraperItem


class CraigsPySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["craigslist.org"]
start_urls = (
'http://sfbay.craigslist.org/search/npo/',
)
rules = (
Rule(LinkExtractor(allow=('\/search\/npo\?s=.*',)), callback='parse_item', follow=True),
)

def parse_item(self, response):
page = response.xpath('//p[@class="result-info"]')
for title in page:
item = CraigslistScraperItem()
item["Name"] = title.xpath('.//a[@class="result-title hdrlnk"]/text()').extract_first()
item["Link"] = title.xpath('.//a[@class="result-title hdrlnk"]/@href').extract_first()
yield item

最后以 csv 格式输出,运行:scrapycraigs -o items.csv

关于python - CrawlSpider 无法解析 Scrapy 中的多页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43208931/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com