- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在尝试抓取 xkcd.com 以检索他们可用的所有图像。当我运行抓取工具时,它会下载 www.xkcd.com/1-1461 范围内的 7-8 个随机图像。我希望它能够连续浏览每一页并保存图像以确保我拥有完整的集合。
下面是我编写的用于爬行的蜘蛛以及我从 scrapy 收到的输出:
蜘蛛:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from xkcd.items import XkcdItem
class XkcdimagesSpider(CrawlSpider):
name = "xkcdimages"
allowed_domains = ["xkcd.com"]
start_urls = ['http://www.xkcd.com']
rules = [Rule(LinkExtractor(allow=['\d+']), 'parse_xkcd')]
def parse_xkcd(self, response):
image = XkcdItem()
image['title'] = response.xpath(\
"//div[@id='ctitle']/text()").extract()
image['image_urls'] = response.xpath(\
"//div[@id='comic']/img/@src").extract()
return image
输出
2014-12-18 19:57:42+1300 [scrapy] INFO: Scrapy 0.24.4 started (bot: xkcd)
2014-12-18 19:57:42+1300 [scrapy] INFO: Optional features available: ssl, http11, django
2014-12-18 19:57:42+1300 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'xkcd.spiders', 'SPIDER_MODULES': ['xkcd.spiders'], 'DOWNLOAD_DELAY': 1, 'BOT_NAME': 'xkcd'}
2014-12-18 19:57:42+1300 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-12-18 19:57:43+1300 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Spider opened
2014-12-18 19:57:43+1300 [xkcdimages] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-12-18 19:57:43+1300 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com> (referer: None)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Filtered offsite request to 'creativecommons.org': <GET http://creativecommons.org/licenses/by-nc/2.5/>
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://xkcd.com/1461/large/> (referer: http://www.xkcd.com)
2014-12-18 19:57:43+1300 [xkcdimages] DEBUG: Scraped from <200 http://xkcd.com/1461/large/>
{'image_urls': [], 'images': [], 'title': []}
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1/> (referer: http://www.xkcd.com)
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg> referred in <None>
2014-12-18 19:57:45+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1/>
{'image_urls': [u'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'],
'images': [{'checksum': '953bf3bf4584c2e347eaaba9e4703c9d',
'path': 'full/ab31199b91c967a29443df3093fac9c97e5bbed6.jpg',
'url': 'http://imgs.xkcd.com/comics/barrel_cropped_(1).jpg'}],
'title': [u'Barrel - Part 1']}
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/556/> (referer: http://www.xkcd.com)
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg> referred in <None>
2014-12-18 19:57:46+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/556/>
{'image_urls': [u'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'],
'images': [{'checksum': 'c88a6e5a3018bce48861bfe2a2255993',
'path': 'full/b523e12519a1735f1d2c10cb8b803e0a39bf90e5.jpg',
'url': 'http://imgs.xkcd.com/comics/alternative_energy_revolution.jpg'}],
'title': [u'Alternative Energy Revolution']}
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/688/> (referer: http://www.xkcd.com)
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/self_description.png> referred in <None>
2014-12-18 19:57:47+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/688/>
{'image_urls': [u'http://imgs.xkcd.com/comics/self_description.png'],
'images': [{'checksum': '230b38d12d5650283dc1cc8a7f81469b',
'path': 'full/e754ff4560918342bde8f2655ff15043e251f25a.jpg',
'url': 'http://imgs.xkcd.com/comics/self_description.png'}],
'title': [u'Self-Description']}
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/162/> (referer: http://www.xkcd.com)
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/angular_momentum.jpg> referred in <None>
2014-12-18 19:57:48+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/162/>
{'image_urls': [u'http://imgs.xkcd.com/comics/angular_momentum.jpg'],
'images': [{'checksum': '83050c0cc9f4ff271a9aaf52372aeb33',
'path': 'full/7c180399f2a2cffeb321c071dea2c669d83ca328.jpg',
'url': 'http://imgs.xkcd.com/comics/angular_momentum.jpg'}],
'title': [u'Angular Momentum']}
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/730/> (referer: http://www.xkcd.com)
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/circuit_diagram.png> referred in <None>
2014-12-18 19:57:49+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/730/>
{'image_urls': [u'http://imgs.xkcd.com/comics/circuit_diagram.png'],
'images': [{'checksum': 'd929f36d6981cb2825b25c9a8dac7c9e',
'path': 'full/15ad254b5cd5c506d701be67f525093af79e6ac0.jpg',
'url': 'http://imgs.xkcd.com/comics/circuit_diagram.png'}],
'title': [u'Circuit Diagram']}
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/150/> (referer: http://www.xkcd.com)
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/grownups.png> referred in <None>
2014-12-18 19:57:50+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/150/>
{'image_urls': [u'http://imgs.xkcd.com/comics/grownups.png'],
'images': [{'checksum': '9d165fd0b00ec88bcc953da19d52a3d3',
'path': 'full/57fdec7b0d3b2c0a146ea77937c776994f631a4a.jpg',
'url': 'http://imgs.xkcd.com/comics/grownups.png'}],
'title': [u'Grownups']}
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Crawled (200) <GET http://www.xkcd.com/1460/> (referer: http://www.xkcd.com)
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: File (uptodate): Downloaded image from <GET http://imgs.xkcd.com/comics/smfw.png> referred in <None>
2014-12-18 19:57:52+1300 [xkcdimages] DEBUG: Scraped from <200 http://www.xkcd.com/1460/>
{'image_urls': [u'http://imgs.xkcd.com/comics/smfw.png'],
'images': [{'checksum': '705b029ffbdb7f2306ccb593426392fd',
'path': 'full/93805911ad95e7f5c2f93a6873a2ae36c0d00f86.jpg',
'url': 'http://imgs.xkcd.com/comics/smfw.png'}],
'title': [u'SMFW']}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Closing spider (finished)
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2173,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 26587,
'downloader/response_count': 9,
'downloader/response_status_count/200': 9,
'file_count': 7,
'file_status_count/uptodate': 7,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 12, 18, 6, 57, 52, 133428),
'item_scraped_count': 8,
'log_count/DEBUG': 27,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 12, 18, 6, 57, 43, 153440)}
2014-12-18 19:57:52+1300 [xkcdimages] INFO: Spider closed (finished)
最佳答案
您需要在crawling rules中设置follow
参数True
。尝试这样的事情:
linkextractor = LinkExtractor(allow=('\d+'), unique=True)
rules = [Rule(linkextractor, callback='parse_xkcd', follow=True)]
关于python - 使用 scrapy 从 XKCD 中抓取图像,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27542984/
所以 2013 年 4 月 1 日 xkcd Externalities网络漫画以 Skein 1024 1024 散列为特色 contest .我假设这一定只是一种蛮力的努力,其中随机字符串被散列以
流行的漫画 xkcd 提出了这个将时间完整转换为日期的等式: 我一直在尝试在 JavaScript 中做同样的事情,尽管我不断得到 -Infinity。这是代码: var p = 5; // Perc
我按照小插图中给出的说明尝试使用 xkcd 字体:vignette("xkcd-intro") 但是在以下步骤中出现错误: > system("cp xkcd.tff -t ~/.fonts") cp
我想创建图表,就像在 link对于上一个主题,但收到错误: In grid.Call(L_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : Font
关于xkcd site今天,以下内容作为笑话出现在 标签 那么下面的代码做什么/代表什么? (define (eval exp env) (cond ((self-evaluating? e
我想在背景中创建一些带有网格的 xkcd 图。像这样example , 但当 plt.xkcd()网格没有出现在图中。 from matplotlib import pyplot as plt imp
我正在尝试让 matplotlib xkcd 函数正常工作,在安装所有必要的东西之后执行此操作。 http://matplotlib.org/xkcd/examples/showcase/xkcd.h
我一直在使用 matplotlib 中的 xkcd 样式功能。我有 Matplotlib 1.4,并且 humor sans 字体安装正确,或者至少可以在 msword 中使用。 当我运行以下示例代码
将 xkcd() 与 matplotpib 一起使用时,没有任何字体以通常的漫画字体显示。有什么变化还是我做错了什么? x = df['Time'] y = df['Adjustmen
当我像这样制作一个简单的 xkcd 图时: import matplotlib.pyplot as plt plt.xkcd() plt.plot(range(10), range(10)) plt.
如此有才华的人已经想出了如何制作xkcd样式图 in Mathematica , in LaTeX , in Python和 in R已经。 如何使用 MATLAB 绘制如上图所示的绘图? 我尝试过的
由于第 9 行中的 double 换行符,以下 python3 代码不起作用: # -*- coding: utf-8 -*- from matplotlib import pyplot as plt
好吧,我知道这个问题已经被问过很多次了,但我没有得到这个工作: 我尝试在 Ubuntu 16.04 LTS 64-bit 和 python 2.7.12 64- 上的 matplotlib 中使用
您通过以下方式打开 xkcd 样式: import matplotlib.pyplot as plt plt.xkcd() 但是如何禁用它呢? 我尝试: self.fig.clf() 但它不会起作用。
关闭。这个问题是off-topic .它目前不接受答案。 想改进这个问题吗? Update the question所以它是on-topic用于堆栈溢出。 关闭 9 年前。 Improve this
matplotlib (1.3.1) 的当前版本支持 xkcd 风格的绘图,但是当我 follow the basic instructions用于生成这样的图(在 iPython 1.1.0 中),
已关闭。这个问题是 off-topic 。目前不接受答案。 想要改进这个问题吗? Update the question所以它是on-topic用于堆栈溢出。 已关闭11 年前。 Improve th
我正在尝试抓取 xkcd.com 以检索他们可用的所有图像。当我运行抓取工具时,它会下载 www.xkcd.com/1-1461 范围内的 7-8 个随机图像。我希望它能够连续浏览每一页并保存图像以确
我正在尝试从 xkcd.com 网站获取主图像的 src(URL) 链接。我正在使用以下代码,但它返回类似 session="2f69dd2e-b377-4d1f-9779-16dad1965b81"
我已经设法在以下位置复制图表: http://matplotlib.org/xkcd/examples/showcase/xkcd.html 但是我的字体看起来不像那样。如何将文本字体更改为炫酷的 x
我是一名优秀的程序员,十分优秀!