gpt4 book ai didi

Python scrapy ReactorNotRestartable 替代品

转载 作者:太空狗 更新时间:2023-10-29 21:09:32 31 4
gpt4 key购买 nike

我一直在尝试使用具有以下功能的 Scrapy 在 Python 中制作一个应用程序:

  • rest api(我是用 f​​lask 做的)监听所有爬取/抓取请求并在爬取后返回响应。(爬取部分足够短, 这样连接就可以一直保持到爬取完成。)

我可以使用以下代码执行此操作:

items = []
def add_item(item):
items.append(item)

# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)

# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable
crawler.crawl(requestParams=requestParams)
# start crawling
reactor.run() #@UndefinedVariable
return str(items)

现在我面临的问题是让 react 堆停止后(这对我来说似乎是必要的,因为我不想坚持 reactor.run())。在第一次请求后,我无法接受进一步的请求。第一个请求完成后,出现以下错误:

Traceback (most recent call last):
File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
reactor.run() #@UndefinedVariable
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable

这很明显,因为我们无法重新启动 react 堆。

所以我的问题是:

1) 如何为下一次抓取请求提供支持?

2) 有没有办法在不停止 reactor.run() 之后移动到下一行?

最佳答案

这里有一个简单的解决方案来解决您的问题

from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__)

class myThread (threading.Thread):
def __init__(self,target):
threading.Thread.__init__(self)
self.target = target
def run(self):
start_crawl()

def start_crawl():
pid = subprocess.Popen([sys.executable, "start_request.py"])
return


@app.route("/crawler/start")
def start_req():
print ":request"
threadObj = myThread("run_crawler")
threadObj.start()
return "Your crawler is in running state"
if (__name__ == "__main__"):
app.run(port = 5000)

在上述解决方案中,我假设您能够使用 shell/命令行上的命令 start_request.py 文件从命令行启动爬虫。

现在我们正在做的是在 python 中使用线程为每个传入请求启动一个新线程。现在您可以轻松地为每次点击并行运行您的爬虫实例。只需使用 threading.activeCount() 控制您的线程数

关于Python scrapy ReactorNotRestartable 替代品,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39434406/

31 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com