gpt4 book ai didi

python - 从 Python 脚本向 Scrapy Spider 传递参数

转载 作者:太空狗 更新时间:2023-10-30 00:55:07 25 4
gpt4 key购买 nike

我只提到了我在发布此问题之前提到的一些问题(我目前没有链接到我在发布此问题之前提到的所有这些问题)-:

如果我不传递参数并从 BBSpider 类中请求用户输入(没有 main 函数 - 在 name="dmoz"行下方),我可以完全运行此代码,或者将它们作为预定义(即静态)参数提供。

我的密码是here .

我基本上是在尝试从 Python 脚本执行 Scrapy 蜘蛛,而不需要任何其他文件(甚至是设置文件)。这就是为什么我也在代码本身内指定了设置。

这是我执行此脚本时得到的输出:

http://bigbasket.com/ps/?q=apple
2015-06-26 12:12:34 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:12:34 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:12:34 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:12:35 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:12:35 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:12:35 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:12:35 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:12:35 [scrapy] INFO: Spider opened
2015-06-26 12:12:35 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:12:35 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:12:35 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 57, in _set_url
raise TypeError('Request url must be str or unicode, got %s:' % type(url).__name__)
TypeError: Request url must be str or unicode, got NoneType:
2015-06-26 12:12:35 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:12:35 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 342543),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 42, 35, 339158)}
2015-06-26 12:12:35 [scrapy] INFO: Spider closed (finished)

我目前面临的问题-:

  • 如果你仔细看我输出的第 1 行和第 6 行,我传递给蜘蛛的 start_url 被打印了两次,即使我在代码的第 31 行只写了一次打印语句(我给了它的链接多于)。为什么会发生这种情况,并且对于不同的值也是如此(第 1 行(我的输出)的初始打印语句输出给出了正确的结果,尽管第 6 行(我的输出)的打印语句输出?不仅如此,即使我write - 打印 'hi' - 然后它也会打印两次。为什么会这样?
  • 接下来,如果你看到我输出的这一行: TypeError:请求 url 必须是 str 或 unicode,得到 NoneType:为什么会这样(即使我上面发布的问题的链接写了同样的东西)?我不知道如何解决它?我什至尝试过 `self.start_urls=[str(kwargs.get('start_url'))]` - 然后它给出了以下输出:
http://bigbasket.com/ps/?q=apple
2015-06-26 12:28:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-26 12:28:00 [scrapy] INFO: Optional features available: ssl, http11
2015-06-26 12:28:00 [scrapy] INFO: Overridden settings: {}
2015-06-26 12:28:00 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
None
2015-06-26 12:28:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-26 12:28:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-26 12:28:01 [scrapy] INFO: Enabled item pipelines:
2015-06-26 12:28:01 [scrapy] INFO: Spider opened
2015-06-26 12:28:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-26 12:28:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-26 12:28:01 [scrapy] ERROR: Error while obtaining start requests
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/scrapy/core/engine.py", line 110, in _next_request
request = next(slot.start_requests)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 70, in start_requests
yield self.make_requests_from_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 73, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 24, in __init__
self._set_url(url)
File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 59, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: None
2015-06-26 12:28:01 [scrapy] INFO: Closing spider (finished)
2015-06-26 12:28:01 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 248350),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 6, 26, 6, 58, 1, 236056)}
2015-06-26 12:28:01 [scrapy] INFO: Spider closed (finished)

请帮我解决以上2个错误。

最佳答案

您需要在CrawlerProcesscrawl 方法上传递您的参数,因此您需要像这样运行它:

crawler = CrawlerProcess(Settings())
crawler.crawl(BBSpider, start_url=url)
crawler.start()

关于python - 从 Python 脚本向 Scrapy Spider 传递参数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31066959/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com