gpt4 book ai didi

python - 如何使用 scrapy 回调在两个蜘蛛之间传递参数

转载 作者:行者123 更新时间:2023-12-01 09:15:01 25 4
gpt4 key购买 nike

我有两个 scrapy,第一个抓取站点地图并提取网址并将其放入 txt 文件中,第二个抓取它并逐行抓取该网址。

我的代码如下:

class sitemapSpider(SitemapSpider):
name = "filmnetmapSpider"
sitemap_urls = ['http://filmnet.ir/sitemap.xml']
sitemap_rules = [
('/series/', 'parse_item')
]
storage_file = 'urls.txt'

def parse_item(self, response):
videoid = response.url

with open(self.storage_file, 'a') as handle:
yield handle.writelines(videoid + '\n')

第二个蜘蛛:

class filmnetSpider(scrapy.Spider):
name = 'filmnetSpider'

def start_requests(self):
with open('urls.txt') as fp:
for line in fp:
yield Request(line.strip(), callback=self.parse_website)

def parse_website(self, response):
hxs = HtmlXPathSelector(response)
url = hxs.xpath('//script[@type="application/ld+json"]/text()').extract()
url = ast.literal_eval(json.dumps(url))
url = url[1]
obj = json.loads(url)
poster = obj['image']
name = obj['name']
description = obj['description']

如何更改代码以删除对文件的读/写?

如何在其中使用回调?

注意:此代码在一个 scrapy 蜘蛛中不起作用;代码是:两个给定的 scrapy + 波纹管代码,如 doc 中所述。

process = CrawlerProcess()
process.crawl(filmnetSpider)
process.crawl(sitemapSpider)
process.start()

最佳答案

这应该有效:

class sitemapSpider(SitemapSpider):
name = "filmnetmapSpider"
sitemap_urls = ['http://filmnet.ir/sitemap.xml']
sitemap_rules = [
('/series/', 'parse_item')
]

def parse_item(self, response):
videoid = response.url
yield Request(videoid, callback=self.parse_website)

def parse_website(self, response):
hxs = HtmlXPathSelector(response)
url = hxs.xpath('//script[@type="application/ld+json"]/text()').extract()
url = ast.literal_eval(json.dumps(url))
url = url[1]
obj = json.loads(url)
poster = obj['image']
name = obj['name']
description = obj['description']

关于python - 如何使用 scrapy 回调在两个蜘蛛之间传递参数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51346410/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com