gpt4 book ai didi

python - 使用 scrapy 从 XKCD 中抓取图像

转载 作者:行者123 更新时间:2023-11-30 23:16:55 24 4
gpt4 key购买 nike

我正在尝试抓取 xkcd.com 以检索他们可用的所有图像。当我运行抓取工具时,它会下载 www.xkcd.com/1-1461 范围内的 7-8 个随机图像。我希望它能够连续浏览每一页并保存图像以确保我拥有完整的集合。

下面是我编写的用于爬行的蜘蛛以及我从 scrapy 收到的输出:

蜘蛛:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from xkcd.items import XkcdItem

class XkcdimagesSpider(CrawlSpider):
name = "xkcdimages"
allowed_domains = ["xkcd.com"]
start_urls = ['http://www.xkcd.com']
rules = [Rule(LinkExtractor(allow=['\d+']), 'parse_xkcd')]

def parse_xkcd(self, response):
image = XkcdItem()
image['title'] = response.xpath(\
"//div[@id='ctitle']/text()").extract()
image['image_urls'] = response.xpath(\
"//div[@id='comic']/img/@src").extract()
return image

输出

2014-12-18 19:57:42+1300 [scrapy] INFO: Scrapy 0.24.4 started (bot: xkcd)
2014-12-18 19:57:42+1300 [scrapy] INFO: Optional features available: ssl, http11, django
2014-12-18 19:57:42+1300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xkcd.spiders', 'SPIDER_MODULES': ['xkcd.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'xkcd'}
2014-12-18 19:57:42+1300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Spider opened
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com> (referer: None)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-nc/2.5/>
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://xkcd.com/1461/large/> (referer: http://www.xkcd.com)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Scraped from <200 http://xkcd.com/1461/large/>
{'image_urls': [], 'images': [], 'title': []}
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1/> (referer: http://www.xkcd.com)
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg> referred in <None>
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1/>
{'image_urls': [u'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'],
'images': [{'checksum': '953bf3bf4584c2e347eaaba9e4703c9d',
'path': 'full/ab31199b91c967a29443df3093fac9c97e5bbed6.jpg',
'url': 'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'}],
'title': [u'Barrel - Part 1']}
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/556/> (referer: http://www.xkcd.com)
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg> referred in <None>
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/556/>
{'image_urls': [u'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'],
'images': [{'checksum': 'c88a6e5a3018bce48861bfe2a2255993',
'path': 'full/b523e12519a1735f1d2c10cb8b803e0a39bf90e5.jpg',
'url': 'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'}],
'title': [u'Alternative Energy Revolution']}
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/688/> (referer: http://www.xkcd.com)
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/self_description.png> referred in <None>
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/688/>
{'image_urls': [u'http://imgs.xkcd.com/comics/self_description.png'],
'images': [{'checksum': '230b38d12d5650283dc1cc8a7f81469b',
'path': 'full/e754ff4560918342bde8f2655ff15043e251f25a.jpg',
'url': 'http://imgs.xkcd.com/comics/self_description.png'}],
'title': [u'Self-Description']}
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/162/> (referer: http://www.xkcd.com)
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/angular_momentum.jpg> referred in <None>
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/162/>
{'image_urls': [u'http://imgs.xkcd.com/comics/angular_momentum.jpg'],
'images': [{'checksum': '83050c0cc9f4ff271a9aaf52372aeb33',
'path': 'full/7c180399f2a2cffeb321c071dea2c669d83ca328.jpg',
'url': 'http://imgs.xkcd.com/comics/angular_momentum.jpg'}],
'title': [u'Angular Momentum']}
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/730/> (referer: http://www.xkcd.com)
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/circuit_diagram.png> referred in <None>
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/730/>
{'image_urls': [u'http://imgs.xkcd.com/comics/circuit_diagram.png'],
'images': [{'checksum': 'd929f36d6981cb2825b25c9a8dac7c9e',
'path': 'full/15ad254b5cd5c506d701be67f525093af79e6ac0.jpg',
'url': 'http://imgs.xkcd.com/comics/circuit_diagram.png'}],
'title': [u'Circuit Diagram']}
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/150/> (referer: http://www.xkcd.com)
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/grownups.png> referred in <None>
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/150/>
{'image_urls': [u'http://imgs.xkcd.com/comics/grownups.png'],
'images': [{'checksum': '9d165fd0b00ec88bcc953da19d52a3d3',
'path': 'full/57fdec7b0d3b2c0a146ea77937c776994f631a4a.jpg',
'url': 'http://imgs.xkcd.com/comics/grownups.png'}],
'title': [u'Grownups']}
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1460/> (referer: http://www.xkcd.com)
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/smfw.png> referred in <None>
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1460/>
{'image_urls': [u'http://imgs.xkcd.com/comics/smfw.png'],
'images': [{'checksum': '705b029ffbdb7f2306ccb593426392fd',
'path': 'full/93805911ad95e7f5c2f93a6873a2ae36c0d00f86.jpg',
'url': 'http://imgs.xkcd.com/comics/smfw.png'}],
'title': [u'SMFW']}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Closing spider (finished)
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2173,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 26587,
'downloader/response_count': 9,
'downloader/response_status_count/200': 9,
'file_count': 7,
'file_status_count/uptodate': 7,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 18, 6, 57, 52, 133428),
'item_scraped_count': 8,
'log_count/DEBUG': 27,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 12, 18, 6, 57, 43, 153440)}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Spider closed (finished)

最佳答案

您需要在crawling rules中设置follow参数True 。尝试这样的事情:

linkextractor = LinkExtractor(allow=('\d+'), unique=True)
rules = [Rule(linkextractor, callback='parse_xkcd', follow=True)]

关于python - 使用 scrapy 从 XKCD 中抓取图像,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27542984/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com