python - Scrapy - 理解 CrawlSpider 和 LinkExtractor-6ren

python - Scrapy - 理解 CrawlSpider 和 LinkExtractor

转载作者：太空狗更新时间：2023-10-30 02:02:06

26

4

所以我正在尝试使用 CrawlSpider 并理解 Scrapy Docs 中的以下示例:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

rules = (
    # Extract links matching 'category.php' (but not matching 'subsection.php')
    # and follow links from them (since no callback means follow=True by default).
    Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

    # Extract links matching 'item.php' and parse them with the spider's method parse_item
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

然后给出的描述是:

This spider would start crawling example.com’s home page, collecting category links, and item links, parsing the latter with the parse_item method. For each item response, some data will be extracted from the HTML using XPath, and an Item will be filled with it.

据我了解，对于第二条规则，它从 item.php 中提取链接，然后使用 parse_item 方法提取信息。但是，第一条规则的目的到底是什么？它只是说它“收集”了链接。这是什么意思，如果他们不从中提取任何数据，为什么有用？

最佳答案

CrawlSpider 在爬取论坛搜索帖子时非常有用，或者在搜索产品页面时对在线商店进行分类。

这个想法是，您必须“以某种方式”进入每个类别，搜索与您要提取的产品/项目信息相对应的链接。这些产品链接是该示例的第二条规则中指定的链接(它表示在 url 中具有 item.php 的链接)。

现在蜘蛛应该如何继续访问链接，直到找到包含 item.php 的链接？这是第一条规则。它说要访问每个包含 category.php 但不包含 subsection.php 的链接，这意味着它不会从这些链接中完全提取任何“项目”，但它定义了蜘蛛寻找真实元素的路径。

这就是为什么您会看到它在规则中不包含 callback 方法的原因，因为它不会返回该链接响应供您处理，因为它将被直接跟进。

关于python - Scrapy - 理解 CrawlSpider 和 LinkExtractor，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/44527996/

26

4

0

文章推荐： python - 如何在python中的字符串中插入变量值

文章推荐： python - 以不同名称循环保存图像

Python CrawlSpider
我一直在学习如何使用 scrapy，尽管我一开始对 python 的经验很少。我开始学习如何使用 BaseSpider 进行抓取。现在我正在尝试抓取网站，但我遇到了一个让我很困惑的问题。这是来自官方网
python - Scrapy CrawlSpider 不会退出
我对 scrapy Crawlspider 有一个问题:基本上，如果引发 CloseSpider 异常，它不会像应该的那样退出。下面是代码: from scrapy.spiders import Cr
python - Scrapy CrawlSpider 不关注链接
我正在尝试从此类别页面上给出的所有(#123)详细信息页面中抓取一些属性 - http://stinkybklyn.com/shop/cheese/但 scrapy 无法遵循我设置的链接模式，我也检查
python - Scrapy CrawlSpider 输出带有空格和函数
我目前正在用 scrapy 编写一个爬虫。我想抓取网站上显示的所有文本，不是单个页面，而是所有子页面。我正在使用 CrawlSpider，因为我认为它也是为了抓取其他页面而设计的。这是我到目前为止编写
python - Scrapy CrawlSpider - 添加元数据到请求
我正在开发一个 CrawlSpider ，它获取多个 Domain 对象(它包含 main_url 和 name -域名)。然后它会抓取整个页面的每个 Domain 对象。因此有一个 Domain
python - Scrapy CrawlSpider 重试抓取
对于我试图抓取的页面，我有时会在我的响应中返回一个“占位符”页面，其中包含一些自动重新加载的 javascript，直到它获得真正的页面。我可以检测到这种情况何时发生，并且我想重试下载和抓取页面。我在
python - 在 CrawlSpider 中以什么顺序评估规则？
我对规则在 CrawlSpider 中的评估顺序有疑问。如果我有以下代码: from scrapy.contrib.spiders.crawl import CrawlSpider, Rule fro
python - Scrapy Crawlspider 不爬是正则表达式吗？
我正在尝试导航到每个县，然后从这里导航到每个县的每个城市: http://www.accountant-finder.com/CA/California-accountants.html 我的代码打开
python - CrawlSpider 仅获取第一页中匹配链接的子集，然后移动到抓取第二页中的链接
Crawlspider 仅获取列表首页上匹配链接的子集。不久之后，它移动到第二页，成功跟踪所有匹配的链接，完全符合预期。如何让 Crawlspider 在进入第二页之前遵循所有匹配的链接？我在第二条
python - Scrapy:crawlspider 不生成嵌套回调中的所有链接
我写了一个 scrapy crawlspider 来抓取一个结构类似于类别页面 > 类型页面 > 列表页面 > 项目页面的站点。在类别页面上有很多机器类别，每个类别都有一个包含很多类型的类型页面，每个
python - Scrapy CrawlSpider 不抓取第一个着陆页
我是 Scrapy 的新手，我正在做一个抓取练习，我正在使用 CrawlSpider。虽然 Scrapy 框架工作得很好并且它遵循相关链接，但我似乎无法让 CrawlSpider 抓取第一个链接(主页
python - CrawlSpider 无法解析 Scrapy 中的多页
我创建的 CrawlSpider 没有正常工作。它解析第一页，然后停止而不继续到下一页。我做错了什么但无法检测到。希望有人给我一个提示，我应该做什么来纠正它。 “items.py”包括: from s
python - 在 Scrapy 中初始化 CrawlSpider
我在 Scrapy 中编写了一个蜘蛛，它基本上做得很好，并且完全按照它应该做的。问题是我需要对它做一些小的改变，我尝试了几种方法都没有成功(例如修改 InitSpider)。这是脚本现在应该执行的操
python - 具有多个回调的 Scrapy CrawlSpider 规则
我正在尝试创建一个实现 scrapy CrawlSpider 的 ExampleSpider。我的 ExampleSpider 应该能够处理仅包含艺术家信息的页面，仅包含专辑信息的页面，以及包含专辑和
python - crawlspider 不使用文本文件中的 url 进行爬网
问题陈述: 我在文件名为 myurls.csv 的每一行中都有一个论坛 url 列表，如下所示: https://www.drupal.org/user/3178461/track https://w
python - BaseSpider 和 CrawlSpider 在一起
我想知道是否有一种方法可以在 scrapy 中的同一个蜘蛛中同时使用 Base 蜘蛛和 Crawl 蜘蛛! 假设我只想抓取 start_url 中提到的一个 url，然后对同一 start_url 中
python - 带有 Splash 的 CrawlSpider
我的蜘蛛有一些问题。我使用 splash 和 scrapy 来获取由 JavaScript 生成的“下一页”的链接。从第一页下载信息后，我想从后面的页面下载信息，但是LinkExtractor功能不能
python - 为 scrapy CrawlSpider 的方法创建单元测试
最初的问题我正在编写一个 CrawlSpider 类(使用 scrapy 库)并依赖大量的 scrapy 异步魔法使其工作。在这里，精简了: class MySpider(CrawlSpider)
python - BaseSpider 和 CrawlSpider 的区别
我一直在努力理解在网络抓取中使用 BaseSpider 和 CrawlSpider 的概念。我读过 docs.但是BaseSpider上没有提及。如果有人能解释一下 BaseSpider 和 Craw
python - 为什么我的 Scrapy CrawlSpider 规则不起作用？
我已经成功地使用 Scrapy 编写了一个非常简单的爬虫，具有这些给定的约束: 存储所有链接信息(例如: anchor 文本、页面标题)，因此有 2 个回调使用 CrawlSpider 来利用规则，

首页

博学

6Ren·AI

商城

python - Scrapy - 理解 CrawlSpider 和 LinkExtractor