gpt4 book ai didi

mime - 如何在使用scrapy爬行时跳过某些文件类型?

转载 作者:行者123 更新时间:2023-12-04 16:40:26 24 4
gpt4 key购买 nike

我想在使用scrapy 爬行时跳过一些文件类型链接.exe .zip .pdf,但不想使用具有特定url 规则的规则。如何?

更新:

因此,当正文尚未下载时,很难决定是否仅通过 Content-Type 响应此链接。我更改为在下载器中间件中删除 url。谢谢彼得和利奥。

最佳答案

如果您转到 Scrapy 根目录中的 linkextractor.py,您将看到以下内容:

"""
Common code and definitions used by Link extractors (located in
scrapy.contrib.linkextractor).
"""

# common file extensions that are not followed if they occur in links
IGNORED_EXTENSIONS = [
# images
'mng', 'pct', 'bmp', 'gif', 'jpg', 'jpeg', 'png', 'pst', 'psp', 'tif',
'tiff', 'ai', 'drw', 'dxf', 'eps', 'ps', 'svg',

# audio
'mp3', 'wma', 'ogg', 'wav', 'ra', 'aac', 'mid', 'au', 'aiff',

# video
'3gp', 'asf', 'asx', 'avi', 'mov', 'mp4', 'mpg', 'qt', 'rm', 'swf', 'wmv',
'm4a',

# other
'css', 'pdf', 'doc', 'exe', 'bin', 'rss', 'zip', 'rar',
]

但是,由于这适用于链接提取器(并且您不想使用规则),我不确定这是否会解决您的问题(我刚刚意识到您指定您不想使用规则。我以为您有询问如何更改文件扩展名限制而无需直接在规则中指定)。

好消息是,您还可以构建自己的下载器中间件,并将任何/所有请求删除到具有不良扩展名的 url。见 Downloader Middlerware

您可以通过访问 request 来获取请求的 url。对象的 url 属性如下: request.url
基本上,在字符串的末尾搜索“.exe”或任何您想删除的扩展名,如果它包含所述扩展名,则返回 IgnoreRequest异常,请求将立即被丢弃。

更新

为了在下载之前处理请求,您需要确保在自定义下载程序中间件中定义“process_request”方法。

根据 Scrapy 文档
process_request

This method is called for each request that goes through the download middleware.

process_request() should return either None, a Response object, or a Request object.

If it returns None, Scrapy will continue processing this request, executing all other middlewares until, finally, the appropriate downloader handler is called the request performed (and its response downloaded).

If it returns a Response object, Scrapy won’t bother calling ANY other request or exception middleware, or the appropriate download function; it’ll return that Response. Response middleware is always called on every Response.

If it returns a Request object, the returned request will be rescheduled (in the Scheduler) to be downloaded in the future. The callback of the original request will always be called. If the new request has a callback it will be called with the response downloaded, and the output of that callback will then be passed to the original callback. If the new request doesn’t have a callback, the response downloaded will be just passed to the original request callback.

If it returns an IgnoreRequest exception, the entire request will be dropped completely and its callback never called.



所以本质上,只需创建一个下载器类,添加一个方法类 process_request,它以请求对象和蜘蛛对象为参数。如果 url 包含不需要的扩展名,则返回 IgnoreRequest 异常。

这一切都应该在下载页面之前发生。但是,如果您想处理响应 header ,则必须向网页发出请求。

您始终可以在下载器中同时实现 process_request 和 process_response 方法,其想法是立即删除明显的扩展名,然后,如果由于某种原因 url 不包含文件扩展名,则请求将被处理并被捕获process_request 方法(因为您可以在 header 中进行验证)?

关于mime - 如何在使用scrapy爬行时跳过某些文件类型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12140460/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com