gpt4 book ai didi

python - 动态 start_urls 值

转载 作者:太空宇宙 更新时间:2023-11-04 06:02:10 25 4
gpt4 key购买 nike

我是 scrapy 和 python 的新手。我写了一个蜘蛛,它可以很好地处理初始化的 start_urls 值。

如果我在 Init 中的代码中放入文字,它也可以正常工作

{ self.start_urls = ' http://something.com '

但是,当我从一个 json 文件中读取值并创建一个列表时,我得到了关于缺少 %20 的相同错误

我觉得我在 scrapy 或 python 中遗漏了一些明显的东西,因为我是一个 nube。

class SiteFeedConstructor(CrawlSpider, FeedConstructor):

name = "Data_Feed"
start_urls = ['http://www.cnn.com/']

def __init__(self, *args, **kwargs):

FeedConstructor.__init__(self, **kwargs)
kwargs = {}
super(SiteFeedConstructor, self).__init__(*args, **kwargs)

self.name = str(self.config_json.get('name', 'Missing value'))
self.start_urls = str(self.config_json.get('start_urls', 'Missing value'))
self.start_urls = self.start_urls.split(",")

错误:

Traceback (most recent call last):
File "/usr/bin/scrapy", line 4, in <module>
execute()
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 132, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 97, in _run_print_help
func(*a, **kw)
File "/usr/lib/python2.7/dist-packages/scrapy/cmdline.py", line 139, in _run_command
cmd.run(args, opts)
File "/usr/lib/python2.7/dist-packages/scrapy/commands/runspider.py", line 64, in run
self.crawler.crawl(spider)
File "/usr/lib/python2.7/dist-packages/scrapy/crawler.py", line 42, in crawl
requests = spider.start_requests()
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 55, in start_requests
reqs.extend(arg_to_iter(self.make_requests_from_url(url)))
File "/usr/lib/python2.7/dist-packages/scrapy/spider.py", line 59, in make_requests_from_url
return Request(url, dont_filter=True)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 26, in __init__
self._set_url(url)
File "/usr/lib/python2.7/dist-packages/scrapy/http/request/__init__.py", line 61, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: Missing%20value

最佳答案

而不是定义 __init__() 覆盖 start_requests()方法:

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests. This method is also called only once from Scrapy, so it’s safe to implement it as a generator.

class SiteFeedConstructor(CrawlSpider, FeedConstructor):
name = "Data_Feed"

def start_requests(self):
self.name = str(self.config_json.get('name', 'Missing value'))
for url in str(self.config_json.get('start_urls', 'Missing value')).split(","):
yield self.make_requests_from_url(url)

关于python - 动态 start_urls 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24253117/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com