python - Scrapy 教程示例-6ren

python - Scrapy 教程示例

转载作者：行者123 更新时间：2023-11-30 22:50:46

24

4

看看是否有人可以为我指明在 python 中使用 Scrapy 的正确方向。

我已经尝试遵循该示例几天了，但仍然无法获得预期的输出。使用Scrapy教程，http://doc.scrapy.org/en/latest/intro/tutorial.html#defining-our-item ，甚至从 github 存储库下载一个确切的项目，但我得到的输出不是教程中描述的。

from scrapy.spiders import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
    "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]

  def parse(self, response):
    """
    The lines below is a spider contract. For more info see:
    http://doc.scrapy.org/en/latest/topics/contracts.html

    @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
    @scrapes name
    """
    sel = Selector(response)
    sites = sel.xpath('//ul[@class="directory-url"]/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.xpath('a/text()').extract()
        item['url'] = site.xpath('a/@href').extract()
        item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
        items.append(item)

    return items

从 github 下载项目后，我在顶级目录中运行“scrapycrawldmoz”。我得到以下输出:

2016-08-31 00:08:19 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
2016-08-31 00:08:19 [scrapy] INFO: Overridden settings: {'DEFAULT_ITEM_CLASS': 'dirbot.items.Website', 'NEWSPIDER_MODULE': 'dirbot.spiders', 'SPIDER_MODULES': ['dirbot.spiders']}
2016-08-31 00:08:19 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-08-31 00:08:19 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-08-31 00:08:19 [scrapy] INFO: Enabled item pipelines:
['dirbot.pipelines.FilterWordsPipeline']
2016-08-31 00:08:19 [scrapy] INFO: Spider opened
2016-08-31 00:08:19 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-08-31 00:08:19 [scrapy] DEBUG: Telnet console listening on 128.1.2.1:2700
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2016-08-31 00:08:20 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
2016-08-31 00:08:20 [scrapy] INFO: Closing spider (finished)
2016-08-31 00:08:20 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 16179,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 8, 31, 7, 8, 20, 314625),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2016, 8, 31, 7, 8, 19, 882944)}
2016-08-31 00:08:20 [scrapy] INFO: Spider closed (finished)

按照教程期待这个:

[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
 {'desc': [u' - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.\n],
  'link': [u'http://gnosis.cx/TPiP/'],
  'title': [u'Text Processing in Python']}
[scrapy] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
 {'desc': [u' - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n'],
  'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
  'title': [u'XML Processing with Python']}

最佳答案

这个蜘蛛在教程中似乎已经过时了。该网站发生了一些变化，因此所有 xpath 现在都没有捕获任何内容。这很容易修复:

def parse(self, response):
    sites = response.xpath('//div[@class="title-and-desc"]/a')
    for site in sites:
        item = dict()
        item['name'] = site.xpath("text()").extract_first() 
        item['url'] = site.xpath("@href").extract_first() 
        item['description'] = site.xpath("following-sibling::div/text()").extract_first('').strip()
        yield item

为了将来的引用，您始终可以测试特定的 xpath 是否适用于 scrapy shell 命令。
例如我做了什么来测试这个:

$ scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
# test sites xpath
response.xpath('//ul[@class="directory-url"]/li') 
[]
# ok it doesn't work, check out page in web browser
view(response)
# find correct xpath and test that:
response.xpath('//div[@class="title-and-desc"]/a')
# 21 result nodes printed
# it works!

关于python - Scrapy 教程示例，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39243009/

24

4

0

文章推荐： python - 如何在 python 中打印字典列表中的键和值

文章推荐： Python 语法尾部分号

文章推荐： python - Pyplot 用轴移动原点

教程
我正在做一个关于代码学院的教程，我在这里收到一个错误，说“看起来你的函数没有返回‘唉，你没有资格获得信用卡。资本主义就是这样残酷。’”当收入参数为 75 时。”但是该字符串在控制台中返回(由于某种原因
Go 教程 : Channels, Buffered Channels 教程
我正在阅读 Go 的官方教程，但很难理解 Channel 和 Buffered Channels 之间的区别。教程的链接是 https://tour.golang.org/concurrency/2和
MSHTML 教程
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
LinqPad 教程
关闭。这个问题是off-topic .它目前不接受答案。想改进这个问题？ Update the question所以它是on-topic对于堆栈溢出。 9年前关闭。 Improve this que
JavaSpaces 教程
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
iOS5高级编程书籍/教程
作为 iOS 新手，有大量书籍可以满足学习基础知识的需求。现在，我想转向一些高级阅读，例如 OAuth 和 SQLite 以及动态 API 派生的 TableView 等。您可以推荐任何资源吗？最佳
Selenium 教程
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
Android开源游戏引擎+教程
关闭。这个问题是opinion-based .它目前不接受答案。想要改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 关闭 8 年前。 Improve
c - 教程
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 8 年前。
【教程】AWD中如何通过Python批量快速管理服务器？
前言很多同学都知道，我们常见的CTF赛事除了解题赛之外，还有一种赛制叫AWD赛制。在这种赛制下，我们战队会拿到一个或多个服务器。服务器的连接方式通常是SSH链接，并且可能一个战队可能会同时有
1、Memcached 教程
Memcached是一个自由开源的，高性能，分布式内存键值对缓存系统 Memcached 是一种基于内存的key-value存储，用来存储小块的任意数据（字符串、对象），这些数据可以是数据库调用、A
01、Perl 教程
Perl 又名实用报表提取语言，是 Practical Extraction and Report Language 的缩写 Perl 是由拉里·沃尔（Larry Wall）于19
01、WSDL 教程
WSDL 是 Web Services Description Language 的缩写，翻译成中文就是网络服务描述语言 WSDL 是一门基于 XML 的语言，用于描述 Web Services 以
Perl UMMF 教程？
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 6年前关闭。 Improve thi
用于创建自定义用户控件的 WPF 教程
我正在寻找解释在 WPF 中创建自定义用户控件的教程。我想要一个控件，它结合了一个文本 block 、一个文本框和一个启动通用文件打开对话框的按钮。我已经完成了布局，一切都连接好了。它有效，但它是三
sdk - dynamodb 教程
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
Django 教程，后退按钮混淆
我接近 fourth page of the Django tutorial 的开始看着vote查看，最后是这样的: # Always return an HttpResponseRedirect a
emacs - ClojureBox 教程
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用资料或专业知识的支持，但这个问题可能会引发辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the
Qt QSS 教程
是否有任何好的 Qt QSS 教程，或者在某个地方我可以看到样式小部件的示例？如果某处可用，我想要一些完整的引用。除了有关如何设置按钮或某些选项卡样式的小教程外，我找不到任何其他内容。最佳答案 Qt
ASP.NET 教程
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he

首页

博学

6Ren·AI

商城

python - Scrapy 教程示例