gpt4 book ai didi

python - 在 Flask 中使用 Crawler Runner 时如何传递参数?

转载 作者:行者123 更新时间:2023-11-28 16:25:51 24 4
gpt4 key购买 nike

我已经阅读了scrapy -1.0.4的官方文档关于如何以编程方式运行多个蜘蛛。它提供了一种通过Crawler Runner来做到这一点的方法,所以我在我的 Flask 应用程序中使用它。但是有一个问题,我想将参数传递给 Crawler 以成为 Start Urls 的一部分。我不知道该怎么做。这是我的 Flask 应用程序代码:

app.route('/search_process', methods=['GET'])
def search():
configure_logging()
runner = CrawlerRunner()
runner.crawl(EPGDspider)
# runner.crawl(GDSpider)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()
return redirect(url_for('details'))

这是我的蜘蛛程序代码:

__author__ = 'Rabbit'
import scrapy
from scrapy.selector import Selector
from scrapy import Request
from scrapy import Item, Field

class EPGD(Item):

genID = Field()
genID_url = Field()
taxID = Field()
taxID_url = Field()
familyID = Field()
familyID_url = Field()
chromosome = Field()
symbol = Field()
description = Field()

class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
url_list = []
base_url = "http://epgd.biosino.org/EPGD"

for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item

sel_tmp = Selector(response)
link = sel_tmp.xpath('//span[@id="quickPage"]')

for site in link:
url_list.append(site.xpath('a/@href').extract())

for i in range(len(url_list[0])):
if cmp(url_list[0][i], "#") == 0:
if i+1 < len(url_list[0]):
print url_list[0][i+1]
actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
yield Request(actual_url, callback=self.parse)
break
else:
print "The index is out of range!"

如您所见,代码中已经设置了term。我只想将 Flask App 中的参数 term 传递给我的蜘蛛,并动态地组成起始 url。它的效果有点像这个问题中的情况:How to pass a user defined argument in scrapy spider .但是所有的事情都是在 Flask App 中以编程方式完成的,而不是通过命令行。但是我不知道该怎么做,有人可以告诉我如何处理吗?

最佳答案

我用crawl(crawler_or_spidercls, *args, **kwargs)解决了这个问题,你可以通过这个方法传递参数。这是我的 Flask 应用程序代码:

def search():
configure_logging()
runner = CrawlerRunner()
runner.crawl(EPGDspider, term="man")
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run()

还有我的爬虫代码(你可以覆盖_init_方法并构造你的动态start urls):

def __init__(self, term=None, *args, **kwargs):
super(EPGDspider, self).__init__(*args, **kwargs)
self.start_urls = ['http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=%s&submit=Feeling+Lucky' % term]

关于python - 在 Flask 中使用 Crawler Runner 时如何传递参数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36740947/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com