gpt4 book ai didi

python - python中的Scrapy Crawler无法跟踪链接?

转载 作者:太空狗 更新时间:2023-10-29 21:12:51 24 4
gpt4 key购买 nike

我用python的scrapy工具写了一个python的爬虫。以下是python代码:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
#from scrapy.item import Item
from a11ypi.items import AYpiItem

class AYpiSpider(CrawlSpider):
name = "AYpi"
allowed_domains = ["a11y.in"]
start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"]

rules =(
Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item')
)

def parse_item(self,response):
#filename = response.url.split("/")[-1]
#open(filename,'wb').write(response.body)
#testing codes ^ (the above)

hxs = HtmlXPathSelector(response)
item = AYpiItem()
item["foruri"] = hxs.select("//@foruri").extract()
item["thisurl"] = response.url
item["thisid"] = hxs.select("//@foruri/../@id").extract()
item["rec"] = hxs.select("//@foruri/../@rec").extract()
return item

但是,抛出的错误是:

Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help
func(*a, **kw)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command
cmd.run(args, opts)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in run
q.append_spider_name(name, **opts.spargs)
--- <exception caught here> ---
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_name
spider = self._spiders.create(name, **spider_kwargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in create
return self._spiders[spider_name](**spider_kwargs)
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__
self._compile_rules()
File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rules
self._rules = [copy.copy(r) for r in self.rules]
exceptions.TypeError: 'Rule' object is not iterable

谁能给我解释一下这是怎么回事?由于这是文档中提到的内容,并且我将允许字段留空,因此默认情况下它本身应该遵循 True。那么为什么会出错呢?我可以对我的抓取工具进行哪些优化以使其速度更快?

最佳答案

据我所知,您的规则似乎不是可迭代的。看起来您正在尝试将规则设为元组,您应该 read up on tuples in the python documentation .

要解决您的问题,请更改此行:

    rules =(
Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item')
)

收件人:

    rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item'),)

注意到末尾的逗号了吗?

关于python - python中的Scrapy Crawler无法跟踪链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5223531/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com