gpt4 book ai didi

python - Scrapy Spiders - 处理非 HTML 链接(PDF、PPT 等)

转载 作者:行者123 更新时间:2023-12-02 07:09:06 25 4
gpt4 key购买 nike

我正在学习 Scrapy 和 Python,并从一个空白项目开始。我正在使用 Scrapy LxmlLinkExtractor 来解析链接,但蜘蛛在遇到非 HTML 链接/页面(如 PDf 或其他文档)时总是卡住。

问题:如果我只想存储这些 URls(我现在不需要文档的内容...),那么我们如何处理 - 一般来说 - 那些与 Scrapy 的链接

包含文档的示例页面:http://afcorfmc.org/2009.html

这是我的蜘蛛代码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor
from super.items import SuperItem
from scrapy.selector import Selector

class mySuper(CrawlSpider):
name="super"
#on autorise seulement le crawl du site indiqué dans allowed_domains
allowed_domains = ['afcorfmc.org']

#démarrage sur la page d'accueil du site
start_urls = ['http://afcorfmc.org']

rules = (Rule (LxmlLinkExtractor(allow=(),deny=(),restrict_xpaths=()), callback="parse_o", follow= True),)

def parse_o(self,response):
#récupération des datas récoltées (contenu de la page)
sel = Selector(response)

#on prépare item, on va le remplir (souvenez-vous, dans items.py)
item = SuperItem()

#on stocke l'url de la page dans le tableau item
item['url'] = response.url

#on récupère le titre de la page ( titre ) grâce à un chemin xpath
#item['titre'] = sel.xpath('//title/text()').extract()

# on fait passer item à la suite du processus
yield item

最佳答案

scrapy LinkExtractor docs 中所述, LxmlLinkExtractor 默认排除带有某些扩展名的链接:参见 https://github.com/scrapy/scrapy/blob/master/scrapy/linkextractors/init.py#L20

此扩展名列表包括 .pdf.ppt

您可以向 LxmlLinkExtractor 实例添加 deny_extensions 参数并将其留空,例如:

$ scrapy shell http://afcorfmc.org/2009.html
2014-10-27 10:27:02+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
...
2014-10-27 10:27:03+0100 [default] DEBUG: Crawled (200) <GET http://afcorfmc.org/2009.html> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f5b1a6f4910>
[s] item {}
[s] request <GET http://afcorfmc.org/2009.html>
[s] response <200 http://afcorfmc.org/2009.html>
[s] settings <scrapy.settings.Settings object at 0x7f5b2013f450>
[s] spider <Spider 'default' at 0x7f5b19e9bed0>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser

In [1]: from scrapy.contrib.linkextractors.lxmlhtml import LxmlLinkExtractor

In [2]: lx = LxmlLinkExtractor(allow=(),deny=(),restrict_xpaths=(), deny_extensions=())

In [3]: lx.extract_links(response)
Out[3]:
[Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/ANATOMO_PATHOLOGIE_Dr_Guinebretiere.ppt', text='ANATOMO_PATHOLOGIE_Dr_Guinebretiere.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIMIOTHERAPIE_Dr_Toledano.ppt', text='CHIMIOTHERAPIE_Dr_Toledano.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Guglielmina.ppt', text='CHIRURGIE_Dr_Guglielmina.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/CHIRURGIE_Dr_Sebban.ppt', text='CHIRURGIE_Dr_Sebban.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/Cas_clinique_oesophage.ppt', text='Cas_clinique_oesophage.ppt', fragment='', nofollow=False),
Link(url='http://afcorfmc.org/documents/TOPOS/2009/MARS/IMAGERIE_Dr_Seror.ppt', text='IMAGERIE_Dr_Seror.ppt', fragment='', nofollow=False),
...
Link(url='http://afcorfmc.org/documents/TOPOS/2009/OCTOBRE/VB4_Technique%20monoisocentrique%20dans%20le%20sein%20Vero%20Avignon%202009.pdf', text='VB4_Technique monoisocentrique dans le sein Vero Avignon 2009.pdf', fragment='', nofollow=False)]

In [4]:

关于python - Scrapy Spiders - 处理非 HTML 链接(PDF、PPT 等),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26583611/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com