gpt4 book ai didi

python - 在scrapy中为每个类别制作单独的输出文件

转载 作者:行者123 更新时间:2023-12-03 16:49:36 25 4
gpt4 key购买 nike

我试过 抓取 黄页根据其类别。所以我从文本文件加载类别并将其提供给 start_urls。我在这里面临的问题是为每个类别分别保存输出。以下是我尝试实现的代码:

CATEGORIES = []
with open('Catergories.txt', 'r') as f:
data = f.readlines()

for category in data:
CATEGORIES.append(category.strip())
打开 settings.py 中的文件并制作一个列表以在蜘蛛中访问。
蜘蛛:
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from ..items import YellowItem
from scrapy.utils.project import get_project_settings

settings = get_project_settings()


class YpSpider(CrawlSpider):
categories = settings.get('CATEGORIES')

name = 'yp'
allowed_domains = ['yellowpages.com']

start_urls = ['https://www.yellowpages.com/search?search_terms={0}&geo_location_terms=New%20York'
'%2C '
'%20NY'.format(*categories)]
rules = (

Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item',
follow=True),

Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''),
follow=True),
)

def parse_item(self, response):
categories = settings.get('CATEGORIES')
print(categories)
item = YellowItem()
# for data in response.xpath('//section[@class="info"]'):
item['title'] = response.xpath('//h1/text()').extract_first()
item['phone'] = response.xpath('//p[@class="phone"]/text()').extract_first()
item['street_address'] = response.xpath('//h2[@class="address"]/text()').extract_first()
email = response.xpath('//a[@class="email-business"]/@href').extract_first()
try:
item['email'] = email.replace("mailto:", '')
except AttributeError:
pass
item['website'] = response.xpath('//a[@class="primary-btn website-link"]/@href').extract_first()
item['Description'] = response.xpath('//dd[@class="general-info"]/text()').extract_first()
item['Hours'] = response.xpath('//div[@class="open-details"]/descendant-or-self::*/text()[not(ancestor::*['
'@class="hour-category"])]').extract()
item['Other_info'] = response.xpath(
'//dd[@class="other-information"]/descendant-or-self::*/text()').extract()
category_ha = response.xpath('//dd[@class="categories"]/descendant-or-self::*/text()').extract()
item['Categories'] = " ".join(category_ha)
item['Years_in_business'] = response.xpath('//div[@class="number"]/text()').extract_first()
neighborhood = response.xpath('//dd[@class="neighborhoods"]/descendant-or-self::*/text()').extract()
item['neighborhoods'] = ' '.join(neighborhood)
item['other_links'] = response.xpath('//dd[@class="weblinks"]/descendant-or-self::*/text()').extract()

item['category'] = '{0}'.format(*categories)

return item


这是 pipelines.py 文件:
from scrapy import signals
from scrapy.exporters import CsvItemExporter
from scrapy.utils.project import get_project_settings

settings = get_project_settings()


class YellowPipeline(object):
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline

def spider_opened(self, spider):
self.exporters = {}
categories = settings.get('CATEGORIES')

file = open('{0}.csv'.format(*categories), 'w+b')

exporter = CsvItemExporter(file, encoding='cp1252')
exporter.fields_to_export = ['title', 'phone', 'street_address', 'website', 'email', 'Description',
'Hours', 'Other_info', 'Categories', 'Years_in_business', 'neighborhoods',
'other_links']
exporter.start_exporting()
for category in categories:
self.exporters[category] = exporter

def spider_closed(self, spider):

for exporter in iter(self.exporters.items()):
exporter.finish_exporting()

def process_item(self, item, spider):

self.exporters[item['category']].export_item(item)
return item
运行代码后,我收到以下错误:
exporter.finish_exporting()
AttributeError: 'tuple' object has no attribute 'finish_exporting'
我需要为每个类别单独的 csv 文件。任何帮助,将不胜感激。

最佳答案

我会在后期处理中做到这一点。将所有项目导出到一个带有类别字段的 .csv 文件。我认为你没有以正确的方式思考这个问题,而且把它复杂化了。不确定这是否有效,但值得一试:)

with open('parent.csv', 'r') as file:
reader = csv.reader(file)
for row in reader:
with open('{}.csv'.format(row[category]), 'a') as f:
writer = csv.writer(f)
writer.writerow(row)

您也可以使用蜘蛛关闭信号应用此代码。

https://docs.scrapy.org/en/latest/topics/signals.html#scrapy.signals.spider_closed

关于python - 在scrapy中为每个类别制作单独的输出文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/60625950/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com