gpt4 book ai didi

python - Scrapy 蜘蛛在从列表中的第一个城市获得结果后停止运行

转载 作者:太空宇宙 更新时间:2023-11-04 00:02:40 25 4
gpt4 key购买 nike

我构建了一个爬虫来运行一个工作站点并将所有潜在的工作数据保存到一个 csv 文件,然后是我的 MySQL 数据库。出于某种原因,爬虫在从列表中的第一个城市拉取工作后停止运行。这就是我的意思:

城市列表代码:

Cities = {
'cities':[ 'washingtondc',
'newyork',
'sanfrancisco',
'...',
'...']
}

Scrapy 蜘蛛代码:

# -*- coding: utf-8 -*-
from city_list import Cities
import scrapy, os, csv, glob, pymysql.cursors

class JobsSpider(scrapy.Spider):
name = 'jobs'
c_list = Cities['cities']
for c in c_list:
print(f'Searching {c} for jobs...')
allowed_domains = [f'{c}.jobsite.com']
start_urls = [f'https://{c}.jobsite.com/search/jobs/']

def parse(self, response):
listings = response.xpath('//li[@class="listings-path"]')
for listing in listings:
date = listing.xpath('.//*[@class="date-path"]/@datetime').extract_first()
link = listing.xpath('.//a[@class="link-path"]/@href').extract_first()
text = listing.xpath('.//a[@class="text-path"]/text()').extract_first()

yield scrapy.Request(link,
callback=self.parse_listing,
meta={'date': date,
'link': link,
'text': text})

next_page_url = response.xpath('//a[text()="next-path "]/@href').extract_first()
if next_page_url:
yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)

def parse_listing(self, response):
date = response.meta['date']
link = response.meta['link']
text = response.meta['text']
compensation = response.xpath('//*[@class="compensation-path"]/span[1]/b/text()').extract_first()
employment_type = response.xpath('//*[@class="employment-type-path"]/span[2]/b/text()').extract_first()
images = response.xpath('//*[@id="images-path"]//@src').extract()
address = response.xpath('//*[@id="address-path"]/text()').extract()

yield {'date': date,
'link': link,
'text': text,
'compensation': compensation,
'type': employment_type,
'images': images,
'address': address}

def close(self, reason):
csv_file = max(glob.iglob('*.csv'), key=os.path.getctime)


conn = pymysql.connect(host='localhost',
user='root',
password='**********',
db='jobs_database',
charset='utf8mb4',
cursorclass=pymysql.cursors.DictCursor)

cur = conn.cursor()
csv_data = csv.reader(open('jobs.csv'))

for row in csv_data:
cur.execute('INSERT INTO jobs_table(date, link, text, compensation, type, images, address)' 'VALUES(%s, %s, %s, %s, %s, %s, %s)', row)

conn.commit()
conn.close()
print("Done Importing!")

抓取器工作正常,但在从 washingtondc 抓取作业后停止运行并退出。

我该如何解决这个问题?

更新 -我将上面的代码更改为

class JobsSpider(scrapy.Spider):
name = 'jobs'
allowed_domains = []
start_urls = []

def __init__(self, *args, **kwargs):
super().__init__(self, *args, **kwargs)
c_list = Cities['cities']
for c in c_list:
print(f'Searching {c} for jobs...')
self.allowed_domains.append(f'{c}.jobsearch.com')
self.start_urls.append(f'https://{c}.jobsearch.com/search/jobs/')


def parse(self, response):
...

现在收到“RecursionError:调用 Python 对象时超出最大递归深度”

这是回溯:

Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 1034, in emit
msg = self.format(record)
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 880, in format
return fmt.format(record)
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 619, in format
record.message = record.getMessage()
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/logging/__init__.py", line 380, in getMessage
msg = msg % self.args
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
File "/usr/local/Cellar/python/3.7.2_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiders/__init__.py", line 107, in __str__
return "<%s %r at 0x%0x>" % (type(self).__name__, self.name, id(self))
[Previous line repeated 479 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object

最佳答案

第一个问题是您的蜘蛛变量和方法在 for 循环内。相反,您需要在 __init__() 中设置这些成员变量。在不测试您的其余逻辑的情况下,这里是您需要做的粗略想法:

class JobsSpider(scrapy.Spider):
name = 'jobs'
# Don't do the for loop here.
allowed_domains = []
start_urls = []

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
c_list = Cities['cities']
for c in c_list:
self.allowed_domains.append(f'{c}.jobsite.com')
self.start_urls.append(f'https://{c}.jobsite.com/search/jobs/')

def parse(self, request):
# ...

如果在此之后您仍然有问题,请更新您的问题,我会尝试更新答案。


解释出了什么问题:当您在问题中有一个 for 循环时,它最终会覆盖变量和函数。这是直接在 Python 的 shell 中的示例:

>>> class SomeClass:
... for i in range(3):
... print(i)
... value = i
... def get_value(self):
... print(self.value)
...
0
1
2
>>> x = SomeClass()
>>> x.value
2
>>> x.get_value()
2

基本上,for 循环甚至在您使用该类之前就已执行。因此,这最终不会多次运行函数,而是多次重新定义它。最终结果是您的函数和变量指向最后设置的内容。

关于python - Scrapy 蜘蛛在从列表中的第一个城市获得结果后停止运行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55192920/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com