gpt4 book ai didi

python - 如何从 Python 脚本中运行 Scrapy

转载 作者:IT老高 更新时间:2023-10-28 20:34:14 25 4
gpt4 key购买 nike

我是 Scrapy 的新手,我正在寻找一种从 Python 脚本运行它的方法。我找到了 2 个可以解释这一点的来源:

http://tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/

http://snipplr.com/view/67006/using-scrapy-from-a-script/

我不知道应该把蜘蛛代码放在哪里以及如何从主函数中调用它。请帮忙。这是示例代码:

# This snippet can be used to run scrapy spiders independent of scrapyd or the scrapy command line tool and use it from a script. 
#
# The multiprocessing library is used in order to work around a bug in Twisted, in which you cannot restart an already running reactor or in this case a scrapy instance.
#
# [Here](http://groups.google.com/group/scrapy-users/browse_thread/thread/f332fc5b749d401a) is the mailing-list discussion for this snippet.

#!/usr/bin/python
import os
os.environ.setdefault('SCRAPY_SETTINGS_MODULE', 'project.settings') #Must be at the top before other imports

from scrapy import log, signals, project
from scrapy.xlib.pydispatch import dispatcher
from scrapy.conf import settings
from scrapy.crawler import CrawlerProcess
from multiprocessing import Process, Queue

class CrawlerScript():

def __init__(self):
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
dispatcher.connect(self._item_passed, signals.item_passed)

def _item_passed(self, item):
self.items.append(item)

def _crawl(self, queue, spider_name):
spider = self.crawler.spiders.create(spider_name)
if spider:
self.crawler.queue.append_spider(spider)
self.crawler.start()
self.crawler.stop()
queue.put(self.items)

def crawl(self, spider):
queue = Queue()
p = Process(target=self._crawl, args=(queue, spider,))
p.start()
p.join()
return queue.get(True)

# Usage
if __name__ == "__main__":
log.start()

"""
This example runs spider1 and then spider2 three times.
"""
items = list()
crawler = CrawlerScript()
items.append(crawler.crawl('spider1'))
for i in range(3):
items.append(crawler.crawl('spider2'))
print items

# Snippet imported from snippets.scrapy.org (which no longer works)
# author: joehillen
# date : Oct 24, 2010

谢谢。

最佳答案

所有其他答案均引用 Scrapy v0.x。根据the updated docs , Scrapy 1.0 要求:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
# Your spider definition
...

process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

关于python - 如何从 Python 脚本中运行 Scrapy,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13437402/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com