gpt4 book ai didi

python - 循环运行 scrapy 任务

转载 作者:太空宇宙 更新时间:2023-11-04 10:17:42 24 4
gpt4 key购买 nike

我有这个代码:

    from logging import INFO

import scrapy

class LinkedInAnonymousSpider(scrapy.Spider):
name = "linkedin_anonymous"
allowed_domains = ["linkedin.com"]
start_urls = []

base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"

def __init__(self, input=None, first=None, last=None):
self.input = input # source file name
self.first = first
self.last = last

def start_requests(self):
if self.first and self.last: # taking input from command line parameters
url = self.base_url % (self.first, self.last)
yield self.make_requests_from_url(url)
elif self.input: # taking input from file
i = 0
self.log('Input from file: %s' % self.input, INFO)
for line in open(self.input, 'r').readlines():
i += 1
if line.strip(): # no blank line
t = line.split("\t")
name = t[0]
parts = [n.strip() for n in name.split(' ')]
last = parts.pop()
first = " ".join(parts)

if first and last:
url = self.base_url % (first, last)
yield self.make_requests_from_url(url)
else:
raise Exception('No input.')

def parse(self, response):
# if there is exactly one match the person's profile page is returned
if response.xpath('//div[@class="profile-overview-content"]').extract():
yield scrapy.Request(response.url, callback=self.parse_full_profile_page)
else:
# extracting profile urls from search result
for sel in response.css('div.profile-card'):
url = sel.xpath('./*/h3/a/@href').extract()[0] # Person's full profile URL in LinkedIn
yield scrapy.Request(url, callback=self.parse_full_profile_page)
........

通过这段代码,我可以从 linkedin 获取人员列表的个人资料详细信息。

为了做到这一点,我写了这样一个 main 函数。

import scrapy
import sys

from linkedin_anonymous_spider import LinkedInAnonymousSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from twisted.internet import reactor

if __name__ == "__main__":
firstname = ['Hasan', 'James']
lastname = ['Arslan', 'Bond']
for a in range(len(firstname)):
settings = get_project_settings()
crawler = CrawlerProcess(settings)
spider = LinkedInAnonymousSpider()
crawler.crawl(spider, [], firstname[a], lastname[a])
crawler.start()

当循环进行到第 2 步时,出现此错误:

raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

我该如何解决这个问题?

谢谢。

最佳答案

您只能运行一个 react 堆,因此只需调用一次 crawler.start()

尝试将 crawler.start() 传递到循环之外。

关于python - 循环运行 scrapy 任务,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34394753/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com