gpt4 book ai didi

python - 在 Scrapy 中初始化 CrawlSpider

转载 作者:行者123 更新时间:2023-12-01 02:32:39 25 4
gpt4 key购买 nike

我在 Scrapy 中编写了一个蜘蛛,它基本上做得很好,并且完全按照它应该做的。
问题是我需要对它做一些小的改变,我尝试了几种方法都没有成功(例如修改 InitSpider)。这是脚本现在应该执行的操作:

  • 抓取起始网址 http://www.example.de/index/search?method=simple
  • 现在进入网址 http://www.example.de/index/search?filter=homepage
  • 使用规则中定义的模式从这里开始爬行

  • 所以基本上所有需要改变的是在两者之间调用一个 URL。我宁愿不使用 BaseSpider 重写整个事情,所以我希望有人知道如何实现这一点:)

    如果您需要任何其他信息,请告诉我。您可以在下面找到当前脚本。
    #!/usr/bin/python
    # -*- coding: utf-8 -*-

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.http import Request
    from example.items import ExampleItem
    from scrapy.contrib.loader.processor import TakeFirst
    import re
    import urllib

    take_first = TakeFirst()

    class ExampleSpider(CrawlSpider):
    name = "example"
    allowed_domains = ["example.de"]

    start_url = "http://www.example.de/index/search?method=simple"
    start_urls = [start_url]

    rules = (
    # http://www.example.de/index/search?page=2
    # http://www.example.de/index/search?page=1&tab=direct
    Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*$', )), callback='parse_item', follow=True),
    Rule(SgmlLinkExtractor(allow=('\/index\/search\?page=\d*&tab=direct', )), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    # fetch all company entries
    companies = hxs.select("//ul[contains(@class, 'directresults')]/li[contains(@id, 'entry')]")
    items = []

    for company in companies:
    item = ExampleItem()
    item['name'] = take_first(company.select(".//span[@class='fn']/text()").extract())
    item['address'] = company.select(".//p[@class='data track']/text()").extract()
    item['website'] = take_first(company.select(".//p[@class='customurl track']/a/@href").extract())

    # we try to fetch the number directly from the page (only works for premium entries)
    item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/text()").extract())

    if not item['telephone']:
    # if we cannot fetch the number it has been encoded on the client and hidden in the rel=""
    item['telephone'] = take_first(company.select(".//p[@class='numericdata track']/a/@rel").extract())

    items.append(item)
    return items

    编辑

    这是我对 InitSpider 的尝试: https://gist.github.com/150b30eaa97e0518673a
    我从这里得到了这个想法: Crawling with an authenticated session in Scrapy

    如您所见,它仍然继承自 CrawlSpider,但我对核心 Scrapy 文件进行了一些更改(不是我最喜欢的方法)。我让 CrawlSpider 继承自 InitSpider 而不是 BaseSpider ( source )。

    到目前为止,这是有效的,但蜘蛛只是在第一页之后停止而不是拿起所有其他页面。

    此外,这种方法对我来说似乎完全没有必要:)

    最佳答案

    好的,我自己找到了解决方案,它实际上比我最初想象的要简单得多:)

    这是简化的脚本:

    #!/usr/bin/python
    # -*- coding: utf-8 -*-

    from scrapy.spider import BaseSpider
    from scrapy.http import Request
    from scrapy import log
    from scrapy.selector import HtmlXPathSelector
    from example.items import ExampleItem
    from scrapy.contrib.loader.processor import TakeFirst
    import re
    import urllib

    take_first = TakeFirst()

    class ExampleSpider(BaseSpider):
    name = "ExampleNew"
    allowed_domains = ["www.example.de"]

    start_page = "http://www.example.de/index/search?method=simple"
    direct_page = "http://www.example.de/index/search?page=1&tab=direct"
    filter_page = "http://www.example.de/index/search?filter=homepage"

    def start_requests(self):
    """This function is called before crawling starts."""
    return [Request(url=self.start_page, callback=self.request_direct_tab)]

    def request_direct_tab(self, response):
    return [Request(url=self.direct_page, callback=self.request_filter)]

    def request_filter(self, response):
    return [Request(url=self.filter_page, callback=self.parse_item)]

    def parse_item(self, response):
    hxs = HtmlXPathSelector(response)

    # fetch the items you need and yield them like this:
    # yield item

    # fetch the next pages to scrape
    for url in hxs.select("//div[@class='limiter']/a/@href").extract():
    absolute_url = "http://www.example.de" + url
    yield Request(absolute_url, callback=self.parse_item)

    正如您所看到的,我现在正在使用 BaseSpider 并在最后自己生成新的请求。一开始,我简单地介绍了在开始爬行之前需要发出的所有不同请求。

    我希望这对某人有帮助:) 如果您有问题,我会很乐意回答。

    关于python - 在 Scrapy 中初始化 CrawlSpider,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12191631/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com