gpt4 book ai didi

python - Scrapy LinkExtractor - 限制每个 URL 抓取的页面数量

转载 作者:行者123 更新时间:2023-11-28 17:30:54 26 4
gpt4 key购买 nike

我试图在 Scrapy 的 CrawlSpider 中限制每个 URL 的抓取页面数量。我有一个 start_urls 列表,我想对每个 URL 中抓取的页面数量设置限制。一旦达到限制,蜘蛛应该移动到下一个 start_url。

我知道设置中有 DEPTH_LIMIT 参数,但这不是我要找的。

任何帮助都是有用的。

这是我目前拥有的代码:

class MySpider(CrawlSpider):
name = 'test'
allowed_domains = domainvarwebsite
start_urls = httpvarwebsite

rules = [Rule(LinkExtractor(),
callback='parse_item',
follow=True)
]

def parse_item(self, response):
#here I parse and yield the items I am interested in.

编辑

我试图实现这个,但我得到了 exceptions.SyntaxError: invalid syntax (filter_domain.py, line 20) .对正在发生的事情有什么想法吗?

再次感谢。

filter_domain.py

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequest

class FilterDomainbyLimitMiddleware(object):
def __init__(self, domains_to_filter):
self.domains_to_filter = domains_to_filter
self.counter = defaultdict(int)

@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
spider_name = crawler.spider.name
max_to_filter = settings.get('MAX_TO_FILTER')
o = cls(max_to_filter)
return o

def process_request(self, request, spider):
parsed_url = urlparse.urlparse(request.url)
(LINE 20:) if self.counter.get(parsed_url.netloc, 0) < self.max_to_filter[parsed_url.netloc]):
self.counter[parsed_url.netloc] += 1
else:
raise IgnoreRequest()

settings.py

MAX_TO_FILTER = 30

DOWNLOADER_MIDDLEWARES = {
'myproject.filter_domain.FilterDomainbyLimitMiddleware' :400,

}

最佳答案

Scrapy 不直接提供这个,但是你可以创建一个自定义的中间件,像这样:

import urlparse
from collections import defaultdict
from scrapy.exceptions import IgnoreRequest

class FilterDomainbyLimitMiddleware(object):
def __init__(self, domains_to_filter):
self.domains_to_filter = domains_to_filter
self.counter = defaultdict(int)

@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
spider_name = crawler.spider.name
domains_to_filter = settings.get('DOMAINS_TO_FILTER')
o = cls(domains_to_filter)
return o

def process_request(self, request, spider):
parsed_url = urlparse.urlparse(request.url)
if parsed_url.netloc in self.domains_to_filter:
if self.counter.get(parsed_url.netloc, 0) < self.domains_to_filter[parsed_url.netloc]):
self.counter[parsed_url.netloc] += 1
else:
raise IgnoreRequest()

并在如下设置中声明 DOMAINS_TO_FILTER:

DOMAINS_TO_FILTER = {
'mydomain': 5
}

只接受来自该域的 5 个请求。还记得在指定的设置中启用中间件 here

关于python - Scrapy LinkExtractor - 限制每个 URL 抓取的页面数量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34452788/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com