gpt4 book ai didi

python - 使用 Scrapy,如何检查 robots.txt 文件允许单个页面上的链接?

转载 作者:行者123 更新时间:2023-12-04 01:12:14 25 4
gpt4 key购买 nike

如果 robots.txt 文件允许,我将使用 Scrapy 抓取一个页面(通过脚本而不是从控制台)来检查该页面上的所有链接。

scrapy.robotstxt.RobotParser抽象基类中,我找到了方法allowed(url, user_agent) ,但我不知道如何使用它。

import scrapy

class TestSpider(scrapy.Spider):
name = "TestSpider"

def __init__(self):
super(TestSpider, self).__init__()

def start_requests(self):
yield scrapy.Request(url='http://httpbin.org/', callback=self.parse)

def parse(self, response):
if 200 <= response.status < 300:
links = scrapy.linkextractors.LinkExtractor.extract_links(response)
for idx, link in enumerate(links):
# How can I check each link is allowed by robots.txt file?
# => allowed(link.url , '*')

# self.crawler.engine.downloader.middleware.middlewares
# self.crawler AttributeError: 'TestSpider' object has no attribute 'crawler'

要运行“TestSpider”蜘蛛,请在 settings.py 中设置

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

进入项目的顶级目录并运行:

scrapy crawl TestSpider

感谢任何帮助。

我的解决方案:

import scrapy
from scrapy.downloadermiddlewares.robotstxt import RobotsTxtMiddleware
from scrapy.utils.httpobj import urlparse_cached
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class TestSpider(CrawlSpider):
name = "TestSpider"

def __init__(self):
super(TestSpider, self).__init__()
self.le = LinkExtractor(unique=True, allow_domains=self.allowed_domains)
self._rules = [
Rule(self.le, callback=self.parse)
]

def start_requests(self):
self._robotstxt_middleware = None
for middleware in self.crawler.engine.downloader.middleware.middlewares:
if isinstance(middleware, RobotsTxtMiddleware):
self._robotstxt_middleware = middleware
break

yield scrapy.Request(url='http://httpbin.org/', callback=self.parse_robotstxt)

def parse_robotstxt(self, response):
robotstxt_middleware = None
for middleware in self.crawler.engine.downloader.middleware.middlewares:
if isinstance(middleware, RobotsTxtMiddleware):
robotstxt_middleware = middleware
break

url = urlparse_cached(response)
netloc = url.netloc
self._robotsTxtParser = None
if robotstxt_middleware and netloc in robotstxt_middleware._parsers:
self._robotsTxtParser = robotstxt_middleware._parsers[netloc]

return self.parse(response)

def parse(self, response):
if 200 <= response.status < 300:
links = self.le.extract_links(response)
for idx, link in enumerate(links):
# Check if link target is forbidden by robots.txt
if self._robotsTxtParser:
if not self._robotsTxtParser.allowed(link.url, "*"):
print(link.url,' Disallow by robotstxt file')

最佳答案

页面上列出的解析器实现比您发布的链接稍高。

Protego parser

Based on Protego:

  • implemented in Python
  • is compliant with Google’s Robots.txt Specification
  • supports wildcard matching
  • uses the length based rule

Scrapy uses this parser by default.

因此,如果您想要与 scrapy 默认提供的结果相同,请使用 protego。

用法如下(robotstxt是一个robots.txt文件的内容):

>>> from protego import Protego
>>> rp = Protego.parse(robotstxt)
>>> rp.can_fetch("http://example.com/profiles", "mybot")
False

也可以识别和重用当前正在使用的机器人中间件,但对于大多数用例来说,这可能比它的值(value)更麻烦。

编辑:

如果您真的想要重用中间件,您的蜘蛛可以通过self.crawler.engine.downloader.middleware.middlewares访问下载器中间件。
从那里,您需要识别机器人中间件(可能通过类名?)和您需要的解析器(从中间件的 _parsers 属性)。
最后,您将使用解析器的 can_fetch() 方法来检查您的链接。

关于python - 使用 Scrapy,如何检查 robots.txt 文件允许单个页面上的链接?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64495540/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com