python - 爬行时清空输出文件-6ren

python - 爬行时清空输出文件

转载作者：太空宇宙更新时间：2023-11-03 14:56:11

我知道我已经问过类似的问题，但它是一个新的蜘蛛，我也有同样的问题( Crawling data successfully but cannot scraped or write it into csv )...我将我的另一个蜘蛛放在这里，并提供了我应该拥有的输出示例和所有信息我通常需要获取输出文件...有人可以帮助我吗？我必须在周五完成这只蜘蛛...所以，我很着急!!

奇怪的是，我的 Fnac.csv 已创建但始终为空...所以我尝试直接在我想要抓取的页面示例上运行我的蜘蛛，并且我拥有我需要的所有信息...所以，我不明白...也许问题只是来 self 的规则或其他什么？

我的蜘蛛:

# -*- coding: utf-8 -*-
# Every import is done for a specific use
import scrapy                                       # Once you downloaded scrapy, you have to import it in your code to use it.
import re                                           # To use the .re() function, which extracts just a part of the text you crawl. It's using regex (regular expressions)
import numbers                                      # To use mathematics things, in this case : numbers.
from fnac.items import FnacItem                     # To return the items you want. Each item has a space allocated in the momery, created in the items.py file, which is in the second cdiscount_test directory.
from urllib.request import urlopen                  # To use urlopen, which allow the spider to find the links in a page that is in the actual page.
from scrapy.spiders import CrawlSpider, Rule        # To use rules and LinkExtractor, which allowed the spider to follow every url on the page you crawl.
from scrapy.linkextractors import LinkExtractor     # Look above.
from bs4 import BeautifulSoup                       # To crawl an iframe, which is a page in a page in web prgrammation.

# Your spider
class Fnac(CrawlSpider):
    name = 'FnacCom'                             # Name of your spider. You call it in the anaconda prompt.
    allowed_domains = ['fnac.com']               # Web domains allowed by you, your spider cannot enter on a page which is not in that domain.
    start_urls = ['https://www.fnac.com/Index-Vendeurs-MarketPlace/A/']        # The first link you crawl.

    # To allow your spider to follow the urls that are on the actual page.
    rules = (
        Rule(LinkExtractor(), callback='parse_start_url'),
    )

    # Your function that crawl the actual page you're on.
    def parse_start_url(self, response):
        item = FnacItem() # The spider now knowws that the items you want have to be stored in the item variable.

        # First data you want which are on the actual page.
        nb_sales = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/span/text()').re(r'([\d]*) ventes')
        country = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/text()').re(r'([A-Z].*)')

        # To store the data in their right places.
        item['nb_sales'] = ''.join(nb_sales).strip()
        item['country'] = ''.join(country).strip()

        # Find a specific link on the actual page and launch this function on it. It's the place where you will find your two first data.
        test_list = response.xpath('//a/@href')
        for test_list in response.xpath('.//div[@class="ProductPriceBox-item detail"]'):
            temporary = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
            for i in range(len(temporary)):
                scrapy.Request(temporary[i], callback=self.parse_start_url, meta={'dont_redirect': True, 'item': item})

        # To find the iframe on a page, launch the next function.
        yield scrapy.Request(response.url, callback=self.parse_iframe, meta={'dont_redirect': True, 'item': item})

    # Your function that crawl the iframe on a page
    def parse_iframe(self, response):
        f_item1 = response.meta['item'] # Just to use the same item location you used above.

        # Find all the iframe on a page.
        soup = BeautifulSoup(urlopen(response.url), "lxml")
        iframexx = soup.find_all('iframe')

        # If there's at least one iframe, launch the next function on it
        if (len(iframexx) != 0):
            for iframe in iframexx:
                yield scrapy.Request(iframe.attrs['src'], callback=self.extract_or_loop, meta={'dont_redirect': True, 'item': f_item1})

        # If there's no iframe, launch the next function on the link of the page where you looked after the potential iframe.
        else:
            yield scrapy.Request(response.url, callback=self.extract_or_loop, meta={'dont_redirect': True, 'item': f_item1})

    # Function to find the other data.
    def extract_or_loop(self, response):
        f_item2 = response.meta['item'] # Just to use the same item location you used above.

        # The rest of the data you want.
        address = response.xpath('//body//div/p/text()').re(r'.*Adresse \: (.*)\n?.*')
        email = response.xpath('//body//div/ul/li[contains(text(),"@")]/text()').extract()
        name = response.xpath('//body//div/p[@class="customer-policy-label"]/text()').re(r'Infos sur la boutique \: ([a-zA-Z0-9]*\s*)')
        phone = response.xpath('//body//div/p/text()').re(r'.*Tél \: ([\d]*)\n?.*')
        siret = response.xpath('//body//div/p/text()').re(r'.*Siret \: ([\d]*)\n?.*')
        vat = response.xpath('//body//div/text()').re(r'.*TVA \: (.*)')

        # If the name of the seller exist, then return the data.
        if (len(name) != 0):
            f_item2['name'] = ''.join(name).strip()
            f_item2['address'] = ''.join(address).strip()
            f_item2['phone'] = ''.join(phone).strip()
            f_item2['email'] = ''.join(email).strip()
            f_item2['vat'] = ''.join(vat).strip()
            f_item2['siret'] = ''.join(siret).strip()
            yield f_item2

        # If not, there was no data on the page and you have to find all the links on your page and launch the first function on them.
        else:
            for sel in response.xpath('//html/body'):
                list_urls = sel.xpath('//a/@href').extract()
                list_iframe = response.xpath('//div[@class="ProductPriceBox-item detail"]/div/a/@href').extract()
                if (len(list_iframe) != 0):
                    for list_iframe in list_urls:
                        yield scrapy.Request(list_iframe, callback=self.parse_start_url, meta={'dont_redirect': True})
                for url in list_urls:
                    yield scrapy.Request(response.urljoin(url), callback=self.parse_start_url, meta={'dont_redirect': True})

我的设置:

BOT_NAME = 'fnac'

SPIDER_MODULES = ['fnac.spiders']
NEWSPIDER_MODULE = 'fnac.spiders'
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = False
ITEM_PIPELINES = {
   'fnac.pipelines.FnacPipeline': 300,
}

我的管道:

# -*- coding: utf-8 -*-
from scrapy import signals
from scrapy.exporters import CsvItemExporter

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# Define your output file.
class FnacPipeline(CsvItemExporter):
    def __init__(self):
        self.files = {}

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_opened(self, spider):
        f = open('..\\..\\..\\..\\Fnac.csv', 'w').close()
        file = open('..\\..\\..\\..\\Fnac.csv', 'wb')
        self.files[spider] = file
        self.exporter = CsvItemExporter(file)
        self.exporter.start_exporting()

    def spider_closed(self, spider):
        self.exporter.finish_exporting()
        file = self.files.pop(spider)
        file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

我的元素:

# -*- coding: utf-8 -*-
import scrapy

# Define here the models for your scraped items

# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

class FnacItem(scrapy.Item):
    # define the fields for your items :
    # name = scrapy.Field()
    name = scrapy.Field()
    nb_sales = scrapy.Field()
    country = scrapy.Field()
    address = scrapy.Field()
    siret = scrapy.Field()
    vat = scrapy.Field()
    phone = scrapy.Field()
    email = scrapy.Field()

我在提示符中编写的运行蜘蛛的命令是:

scrapy爬取FnacCom

输出示例如下:

2017-08-08 10:21:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Panasonic/TV-par-marque/nsh474980/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:21:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Philips/TV-par-marque/nsh474981/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:21:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Sony/TV-par-marque/nsh475001/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-LG/TV-par-marque/nsh474979/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Samsung/TV-par-marque/nsh474984/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-marque/shi474972/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-prix/shi474946/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-taille-d-ecran/shi474945/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/TV-par-Technologie/shi474944/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Smart-TV-TV-connectee/TV-par-Technologie/nsh474953/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-QLED/TV-par-Technologie/nsh474948/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-4K-UHD/TV-par-Technologie/nsh474947/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Toutes-les-TV/TV-Television/nsh474940/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:26 [scrapy.extensions.logstats] INFO: Crawled 459 pages (at 24 pages/min), scraped 0 items (at 0 items/min)
2017-08-08 10:22:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-Television/shi474914/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/partner/canalplus#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Meilleures-ventes-TV/TV-Television/nsh474942/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:37 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Toutes-nos-Offres/Offres-de-remboursement/shi159784/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Offres-Adherents/Toutes-nos-Offres/nsh81745/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/labofnac#bl=MMtvh#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Lecteur-et-Enregistreur-DVD-Blu-Ray/Lecteur-DVD-Blu-Ray/shi475063/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/TV-OLED/TV-par-Technologie/nsh474949/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:49 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Lecteur-DVD-Portable/Lecteur-et-Enregistreur-DVD-Blu-Ray/nsh475064/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Home-Cinema/Home-Cinema-par-marque/shi475116/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Univers-TV/Univers-Ecran-plat/cl179/w-4#bl=MMtvh> (referer: https://www.fnac.com)
2017-08-08 10:22:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.fnac.com/Casque-TV-HiFi/Casque-par-usage/nsh450507/w-4#bl=MMtvh> (referer: https://www.fnac.com)

非常感谢您的帮助!!!

最佳答案

我编写了一个小代码重构来展示如何在不使用crawlspider和使用常见的scrapy习惯用法的情况下显式地编写spider:

class Fnac(Spider):
    name = 'fnac.com' 
    allowed_domains = ['fnac.com'] 
    start_urls = ['https://www.fnac.com/Index-Vendeurs-MarketPlace/0/']  # The first link you crawl.

    def parse(self, response):
        # parse sellers
        sellers = response.xpath("//h1[contains(selftext(),'MarketPlace')]/following-sibling::ul/li/a/@href").extract()
        for url in sellers:
            yield Request(url, callback=self.parse_seller)

        # parse other pages A-Z
        pages = response.css('.pagerletter a::attr(href)').extract()
        for url in pages:
            yield Request(url, callback=self.parse)

    def parse_seller(self, response):
        nb_sales = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/span/text()').re(r'([\d]*) ventes')
        country = response.xpath('//body//table[@summary="données détaillée du vendeur"]/tbody/tr/td/text()').re(r'([A-Z].*)')
        item = FnacItem()
        # To store the data in their right places.
        item['nb_sales'] = ''.join(nb_sales).strip()
        item['country'] = ''.join(country).strip()
        # go to details page now
        details_url = response.xpath("//iframe/@src[contains(.,'retour')]").extract_first()
        yield Request(details_url, self.parse_seller_details,
                      meta={'item': item})  # carry over our item to next response

    def parse_seller_details(self, response):
        item = response.meta['item']  # get item that's got filled in `parse_seller`
        address = response.xpath('//body//div/p/text()').re(r'.*Adresse \: (.*)\n?.*')
        email = response.xpath('//body//div/ul/li[contains(text(),"@")]/text()').extract()
        # parse here
        yield item

关于python - 爬行时清空输出文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45558429/

文章推荐： c# - 如何使所有平台的 View 相同

文章推荐： python - 如何在链式操作中引用当前版本的 Pandas 数据框

文章推荐： c# - 如何在 wpf c# 中为 FlowDocument 选择打印机

vba - 爬行 ActiveX 按钮
再会! 我有一个 ActiveX 按钮(根据从顶部开始的行数锚定在位置上)，它运行 VBA 代码以在特定点插入指定数量的复制行。代码本身工作正常，但按钮“克隆”自身并将自身覆盖在新行的位置，即使我设置
hadoop - 无法启动 Nutch 爬行
我正在尝试在 Ubuntu 14.04 上部署 Nutch 2.3 + ElasticSearch 1.4 + HBase 0.94 以下 tutorial .当我尝试开始爬行注入(inject)网址
vba - 爬行 Zip 文件
我正在尝试爬行某个驱动器并从埋藏在子目录中的某些 .xls 文件中获取数据。该驱动器超过 1 TB，并且文件夹并不都具有相同的层次结构，因此我正在遍历所有文件夹。到目前为止，该脚本运行良好。问题是，
c# - 在 c++/c# 中从哪里开始抓取/爬行？
首先，我希望抓取是从桌面 .exe 向网站发送请求并获取数据的正确词。如果是，我应该使用什么库或插件？我是否应该使用另一种语言来执行此操作(如 Java 或其他语言？)。我需要一些“提示”，因为我真的
python - 如何通过命令生成 url 让 scrapy 爬行
这是我的代码: def parse(self, response): selector = Selector(response) sites = selector.xpath("//
node.js - 使用 Node.js 爬行
完整的 Node.js 菜鸟，所以不要评判我...... 我有一个简单的要求。爬取网站，查找所有产品页面，并保存产品页面中的一些数据。说的更简单，做的更简单。查看 Node.js 示例，我找不到类
scrapy - 广泛的 Scrapy 爬行 : sgmlLinkextractor rule does not work
我花了很多时间玩弄和使用谷歌，但我无法解决我的问题。我是 Scrapy 的新手，希望你能帮助我。部分有效的爬虫:我从 MySQL 数据库中定义我的 start_requests url。使用“par
Ajax 爬行 : old way vs new way (#! )
老方法当我以前在需要内容被搜索引擎索引的项目中异步加载页面时，我使用了一种非常简单的技术，那就是 Page $('#example').click(function(){
json - 爬行 : Difference between "query string parameter" and "request payload"
我正在尝试使用 Scrapy 抓取 ajax 站点，网址是 http://www.target.com/p/bounty-select-a-size-white-paper-towels-12-meg
java - 无法让 apache nutch 爬行 - 权限和 JAVA_HOME 可疑
我正在尝试按照 NutchTutorial 运行基本爬网: bin/nutch crawl urls -dir crawl -depth 3 -topN 5 所以我已经安装了 Nutch，并使用 So
python - 从 Python 库的角度来看，爬行、解析、索引、搜索之间有什么区别
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
javascript - 使用 GAS : Error - Argument too large:value 进行巨大的网站抓取/爬行
我做了一个抓取脚本，通过爬行逐一抓取任何网站(要输入的url)的内部页面，获取其他内部url并处理它们以获取所有页面并提取其纯文本(剥离的html)。请参阅my previous回答。该脚本运行良好，
java - 使用 selenium : How to keep logged in after close Driver in java 爬行
无论如何，驱动程序是否可以记住登录 session ，所以它不会带我回到登录页面(例如google-chrome)? 这就是我现在正在做的事情 public static void main(Stri
node.js - NodeJS 使用 node-crawler 或 simplecrawler 进行 Web 爬行
我是网络爬虫新手，我需要一些关于这两个 Node JS 爬虫的指导。目标:我的目标是抓取网站并仅获取该域内的内部(本地)URL。我对任何页面数据或抓取不感兴趣。只是 URL。我的困惑:使用 nod

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - 爬行时清空输出文件