gpt4 book ai didi

python - scrapy项目中间件-TypeError : process_start_requests() takes 2 positional arguments but 3 were given

转载 作者:行者123 更新时间:2023-12-01 03:01:57 29 4
gpt4 key购买 nike

一旦我在设置中取消注释项目中间件,就会收到错误

SPIDER_MIDDLEWARES = {
'scrapyspider.middlewares.ScrapySpiderProjectMiddleware': 543,
}

这是我的蜘蛛

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field

class DomainLinks(Item):
links = Field()

class ScrapyProject(CrawlSpider):
name = 'scrapyspider'

#allowed_domains = []
start_urls = ['http://www.example.com']

rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)

def parse_start_url(self, response):
self.parse_links(response)

def parse_links(self, response):
item = DomainLinks()
item['links'] = []

links = LxmlLinkExtractor(allow=(),deny = ()).extract_links(response)

for link in links:
if link.url not in item['links']:
item['links'].append(link.url)

return item

这是从项目中间件文件中提取的一些文本。 process_spider_output是我过滤内部链接的地方,调用process_start_requests会导致错误。

def process_spider_output(response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.

domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]

filtered_result = []
for i in result:
if domain in i:
filtered_result.append(i)


# Must return an iterable of Request, dict or Item objects.
for i in filtered_result:
yield i

def process_start_requests(start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.

# Must return only requests (not items).
for r in start_requests:
yield r

回溯

2017-05-01 12:30:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapyproject.middlewares.scrapyprojectSpiderMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-05-01 12:30:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-05-01 12:30:55 [scrapy.core.engine] INFO: Spider opened
Unhandled error in Deferred:
2017-05-01 12:30:55 [twisted] CRITICAL: Unhandled error in Deferred:

2017-05-01 12:30:55 [twisted] CRITICAL:
Traceback (most recent call last):
File "/home/matt/.local/lib/python3.5/site-packages/twisted/internet/defer.py", line 1301, in _inlineCallbacks
result = g.send(result)
File "/home/matt/.local/lib/python3.5/site-packages/scrapy/crawler.py", line 74, in crawl
yield self.engine.open_spider(self.spider, start_requests)
TypeError: process_start_requests() takes 2 positional arguments but 3 were given

我正在尝试过滤链接,以便仅跟踪/提取内部链接

scrapy 文档不是很清楚..

谢谢

最佳答案

由于我见过的所有 scrapy 中间件都在类内部,我怀疑 self 参数丢失:

def process_spider_output(self, response, result, spider):
# ...

def process_start_requests(self, start_requests, spider):
# ...

希望这有帮助。如果没有,请发布完整的中间件文件。

关于python - scrapy项目中间件-TypeError : process_start_requests() takes 2 positional arguments but 3 were given,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43718657/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com