gpt4 book ai didi

python - 如何在 scrapy 中使用 APscheduler

转载 作者:行者123 更新时间:2023-11-28 18:38:38 24 4
gpt4 key购买 nike

拥有从脚本(http://doc.scrapy.org/en/latest/topics/practices.html#run-scrapy-from-a-script)运行 scrapy 爬虫的代码。但它不起作用。

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():
spider =EgovSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configured
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()


from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()

我的蜘蛛:

import scrapy

class EgovSpider(scrapy.Spider):
name = 'egov'
start_urls = ['http://egov-buryatia.ru/index.php?id=1493']


def parse(self, response):

data = response.xpath("//div[@id='main_wrapper_content_news']//tr//text()").extract()
print data
print response.url
f = open("vac.txt","a")
for d in data:
f.write(d.encode(encoding="UTF-8") + "\n")

f.write(str(now))
f.close()

如果我替换“reactor.run()”行,蜘蛛会在 10 秒后启动一次:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log,signals
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings

def run():
spider =EgovSpider()
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configured
crawler.crawl(spider)
crawler.start()
log.start()

from apscheduler.schedulers.twisted import TwistedScheduler
sched = TwistedScheduler()
sched.add_job(run, 'interval', seconds=10)
sched.start()
reactor.run()

我对 python 和英语经验不足 :) 请帮助我。

最佳答案

我今天遇到了同样的问题。这是一些信息。

Twisted reactor 一旦运行和停止就无法重新启动。您应该启动一个长时间运行的 react 器并定期添加爬虫任务。

为了进一步简化代码,您可以使用 CrawlerProcess.start(),其中包括 reactor.run()。

from scrapy.crawler import CrawlerProcess
from spiders.egov import EgovSpider
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

process = CrawlerProcess(get_project_settings())
sched = TwistedScheduler()
sched.add_job(process.crawl, 'interval', args=[EgovSpider], seconds=10)
sched.start()
process.start(False) # Do not stop reactor after spider closes

关于python - 如何在 scrapy 中使用 APscheduler,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29765039/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com