python - 在 Scrapy 中初始化 CrawlSpider-6ren

python - 在 Scrapy 中初始化 CrawlSpider

转载作者：行者123 更新时间：2023-12-01 02:32:39

25

4

我在 Scrapy 中编写了一个蜘蛛，它基本上做得很好，并且完全按照它应该做的。
问题是我需要对它做一些小的改变，我尝试了几种方法都没有成功(例如修改 InitSpider)。这是脚本现在应该执行的操作:

抓取起始网址 http://www.example.de/index/search?method=simple

现在进入网址 http://www.example.de/index/search?filter=homepage

使用规则中定义的模式从这里开始爬行

所以基本上所有需要改变的是在两者之间调用一个 URL。我宁愿不使用 BaseSpider 重写整个事情，所以我希望有人知道如何实现这一点:)

如果您需要任何其他信息，请告诉我。您可以在下面找到当前脚本。

#!/usr/bin/python
# -*- coding: utf-8 -*-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib

take_first = TakeFirst()

class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.de"]

    start_url = "http://www.example.de/index/search?method=simple"
    start_urls = [start_url]

    rules = (
        # http://www.example.de/index/search?page=2
        # http://www.example.de/index/search?page=1&tab=direct
        Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*$', )), callback='parse_item', follow=True),
        Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*&tab=direct', )), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # fetch all company entries
        companies = hxs.select("//ul[contains(@class, 'directresults')]/li[contains(@id, 'entry')]")
        items = []

        for company in companies:
            item = ExampleItem()
            item['name'] = take_first(company.select(".//span[@class='fn']/text()").extract())
            item['address'] = company.select(".//p[@class='data track']/text()").extract()
            item['website'] = take_first(company.select(".//p[@class='customurl track']/a/@href").extract())

            # we try to fetch the number directly from the page (only works for premium entries)
            item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/text()").extract())

            if not item['telephone']:
              # if we cannot fetch the number it has been encoded on the client and hidden in the rel=""
              item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/@rel").extract())

            items.append(item)
        return items

编辑

这是我对 InitSpider 的尝试: https://gist.github.com/150b30eaa97e0518673a
我从这里得到了这个想法: Crawling with an authenticated session in Scrapy

如您所见，它仍然继承自 CrawlSpider，但我对核心 Scrapy 文件进行了一些更改(不是我最喜欢的方法)。我让 CrawlSpider 继承自 InitSpider 而不是 BaseSpider ( source )。

到目前为止，这是有效的，但蜘蛛只是在第一页之后停止而不是拿起所有其他页面。

此外，这种方法对我来说似乎完全没有必要:)

最佳答案

好的，我自己找到了解决方案，它实际上比我最初想象的要简单得多:)

这是简化的脚本:

#!/usr/bin/python
# -*- coding: utf-8 -*-

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy import log
from scrapy.selector import HtmlXPathSelector
from example.items import ExampleItem
from scrapy.contrib.loader.processor import TakeFirst
import re
import urllib

take_first = TakeFirst()

class ExampleSpider(BaseSpider):
    name = "ExampleNew"
    allowed_domains = ["www.example.de"]

    start_page = "http://www.example.de/index/search?method=simple"
    direct_page = "http://www.example.de/index/search?page=1&tab=direct"
    filter_page = "http://www.example.de/index/search?filter=homepage"

    def start_requests(self):
        """This function is called before crawling starts."""
        return [Request(url=self.start_page, callback=self.request_direct_tab)]

    def request_direct_tab(self, response):
        return [Request(url=self.direct_page, callback=self.request_filter)]

    def request_filter(self, response):
        return [Request(url=self.filter_page, callback=self.parse_item)]

    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)

        # fetch the items you need and yield them like this:
        # yield item

        # fetch the next pages to scrape
        for url in hxs.select("//div[@class='limiter']/a/@href").extract():
            absolute_url = "http://www.example.de" + url             
            yield Request(absolute_url, callback=self.parse_item)

正如您所看到的，我现在正在使用 BaseSpider 并在最后自己生成新的请求。一开始，我简单地介绍了在开始爬行之前需要发出的所有不同请求。

我希望这对某人有帮助:) 如果您有问题，我会很乐意回答。

关于python - 在 Scrapy 中初始化 CrawlSpider，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12191631/

25

4

0

文章推荐： python - 通过列名称和观察结果比较数据框

文章推荐： python - IF OR 逻辑语句未按预期工作

文章推荐： python - 转移概率的递归函数(查普曼-柯尔莫哥洛夫方程)

java - Spring 上下文不使用 @ContextConfiguration 初始化，而是使用 new ClassPathXmlApplicationContext 初始化
我是 Spring 新手，这就是我想要做的事情: 我正在使用一个基于 Maven 的库，它有自己的 Spring 上下文和 Autowiring 字段。它的bean配置文件是src/test/res
具有动态元素数的python列表文字/初始化
我在我的测试脚本中有以下列表初始化: newSequenceCore=["ls", "ns", "*", "cm", "*", "ov", "ov", "ov", "ov", "kd"] (代表要在控
C++初始化
这个问题在这里已经有了答案: 关闭 11 年前。 Possible Duplicate: Class construction with initial values 当我查看 http://en.
非常量静态成员变量的C++初始化？
我得到了成员变量“objectCount”的限定错误。编译器还返回“ISO C++ 禁止非常量静态成员的类内初始化”。这是主类: #include #include "Tree.h" using n
为非虚拟方法指定的c++初始化
我有如下所示的a.h class A { public: void doSomething()=0; }; 然后我有如下所示的b.h #include "a.h" class b: publi
Kotlin 初始化 : how to fail
我需要解析 Firebase DataSnapshot (一个 JSON 对象)转换成一个数据类，其属性包括 enum 和 list。所以我更喜欢通过传递 DataSnapshot 来手动解析它进入二
JQUERY $(function() { 初始化
我使用 JQuery 一段时间了，我总是使用以下代码来初始化我的 javascript: $(document).ready( function() { // Initalisation logic
cocoa - NSString 初始化
这里是 Objective-C 菜鸟。为什么会这样: NSString *myString = [NSString alloc]; [myString initWithFormat:@"%f", s
cocoa - NSArrayController 初始化
我无法让核心数据支持的 NSArrayController 在我的代码中正常工作。下面是我的代码: pageArrayController = [[NSArrayController alloc] i
javascript - 放大的弹出窗口安装/初始化
我对这一切都很陌生，并且无法将其安装到我的后端代码中。它去哪里？在我的页脚下面有我所有的 JS？比如，这是什么意思: Popup initialization code should be exec
java - 初始化 JFrame
这可能是一个简单的问题，但是嘿，我是初学者。所以我创建了一个程序来计算一些东西，它目前正在控制台中运行。我决定向其中添加一个用户界面，因此我使用 NetBeans IDE 中的内置功能创建了一个 J
Phalcon 初始化()不工作
我有 2 个 Controller ，TEST1Controller 和 TEST2Controller 在TEST2Controller中，我有一个initialize()函数设置属性值。如果我尝
javascript - dependentObservable 初始化
据我所知， dependentObservable 在声明时会进行计算。但如果某些值尚不存在怎么办？例如: var viewModel ={}; var dependentObservable1 =
带有关键字参数的 ruby 初始化
我正在阅读 POODR 这本书，它使用旧语法进行默认值初始化。我想用新语法实现相同的功能。 class Gear attr_reader :chainring, :cog, :wheel de
polymer 初始化(无响应)
我按照 polymer 教程的说明进行操作: https://www.polymer-project.org/3.0/start/install-3-0 (我跳过了可选部分) 但是，在我执行命令“po
kotlin - Kotlin和构造函数，初始化
很抱歉问到一个非常新手的Kotlin问题，但是我正在努力理解与构造函数和初始化有关的一些东西。我有这个类和构造函数: class TestCaseBuilder constructor(
c# - 康威的生命游戏 - 初始化
假设我们有一个包含 30 列和 30 行的网格。生命游戏规则简而言之: 一个小区有八个相邻小区当一个细胞拥有三个存活的相邻细胞时，该细胞就会存活如果一个细胞恰好有两个或三个活的相邻细胞，那么它就
java - 初始化 ByteArrayOutputStream？
我是 MQTT 和 Android 开放附件“AOA” 的新手。在阅读教程时，我意识到，在尝试写入 ByteArrayOutputStream 类型的变量之前，应该写入 0 或 0x00首先到该变量。
Phalcon 初始化()不工作
我有 2 个 Controller ，TEST1Controller 和 TEST2Controller 在TEST2Controller中，我有一个initialize()函数设置属性值。如果我尝
inotify - 初始化:目录创建时的奇怪行为
我有一个inotify /内核问题。我正在使用“inotify” Python项目进行观察，但是，我的问题仍然是固有的关于inotify内核实现的核心。 Python inotify项目处理递归ino

首页

博学

6Ren·AI

商城

python - 在 Scrapy 中初始化 CrawlSpider