gpt4 book ai didi

python-2.7 - 从脚本中抓取。不会导出数据

转载 作者:行者123 更新时间:2023-12-03 19:58:43 24 4
gpt4 key购买 nike

我正在尝试从脚本运行scrapy,但无法让程序创建导出文件

我试图让文件以两种不同的方式导出:

  • 带管道
  • 随着饲料导出。

  • 当我从命令行运行 scrapy 时,这两种方式都有效,但当我从脚本运行 scrapy 时,这两种方式都不起作用。

    我并不孤单有这个问题。这是另外两个类似的未回答问题。直到我发布问题后我才注意到这些。
  • JSON not working in scrapy when calling spider through a python script?
  • Calling scrapy from a python script not creating JSON output file

  • 这是我从脚本运行scrapy的代码。它包括使用管道和提要导出器打印输出文件的设置。
    from twisted.internet import reactor

    from scrapy import log, signals
    from scrapy.crawler import Crawler
    from scrapy.xlib.pydispatch import dispatcher
    import logging

    from external_links.spiders.test import MySpider
    from scrapy.utils.project import get_project_settings
    settings = get_project_settings()

    #manually set settings here
    settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
    settings.set('DEPTH_LIMIT',1,priority='cmdline')
    settings.set('LOG_FILE','Log.log',priority='cmdline')
    settings.set('FEED_URI','output.csv',priority='cmdline')
    settings.set('FEED_FORMAT', 'csv',priority='cmdline')
    settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
    settings.set('FEED_STORE_EMPTY',True,priority='cmdline')

    def stop_reactor():
    reactor.stop()

    dispatcher.connect(stop_reactor, signal=signals.spider_closed)
    spider = MySpider()
    crawler = Crawler(settings)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()
    log.start(loglevel=logging.DEBUG)
    log.msg('reactor running...')
    reactor.run()
    log.msg('Reactor stopped...')

    在我运行此代码后,日志显示:“在:output.csv 中存储了 csv 提要(341 个项目)”,但没有找到 output.csv。

    这是我的提要导出器代码:
    settings = get_project_settings()

    #manually set settings here
    settings.set('ITEM_PIPELINES', {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
    settings.set('DEPTH_LIMIT',1,priority='cmdline')
    settings.set('LOG_FILE','Log.log',priority='cmdline')
    settings.set('FEED_URI','output.csv',priority='cmdline')
    settings.set('FEED_FORMAT', 'csv',priority='cmdline')
    settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
    settings.set('FEED_STORE_EMPTY',True,priority='cmdline')


    from scrapy.contrib.exporter import CsvItemExporter


    class CsvOptionRespectingItemExporter(CsvItemExporter):

    def __init__(self, *args, **kwargs):
    delimiter = settings.get('CSV_DELIMITER', ',')
    kwargs['delimiter'] = delimiter
    super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)

    这是我的管道代码:
    import csv

    class CsvWriterPipeline(object):

    def __init__(self):
    self.csvwriter = csv.writer(open('items2.csv', 'wb'))

    def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
    self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])

    return item

    最佳答案

    我有同样的问题。

    这是对我有用的:

  • 将导出uri放入settings.pyFEED_URI='file:///tmp/feeds/filename.jsonlines'
  • 创建一个 scrape.py您旁边的脚本 scrapy.cfg有以下内容
     
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings


    process = CrawlerProcess(get_project_settings())

    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
    process.start() # the script will block here until the crawling is finished

  • 运行:python scrape.py

  • 结果:创建文件。

    备注 :我的项目中没有管道。所以不确定管道是否会过滤你的结果。

    还有 :以下是 docs 上的常见陷阱部分这对我有帮助

    关于python-2.7 - 从脚本中抓取。不会导出数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27573265/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com