python - Cloudflare 碎片-6ren

python - Cloudflare 碎片

转载作者：行者123 更新时间：2023-12-05 07:32:25

26

4

我正在尝试使用 Scrapy 和 Cloudflare 抓取 URL，但我无法获得任何结果:

2018-07-09 22:14:00 [scrapy.core.engine] INFO: Spider opened
2018-07-09 22:14:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-09 22:14:00 [scrapy.extensions.httpcache] DEBUG: Using filesystem 
cache storage in C:\Users\Luis\Mister\.scrapy\httpcache
2018-07-09 22:14:00 [scrapy.extensions.telnet] DEBUG: Telnet console 
listening on 127.0.0.1:6023
2018-07-09 22:14:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.mister-auto.es/robots.txt> (referer: None) ['cached']
2018-07-09 22:14:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.mister-auto.es/global_search2.html? idx=prod_monoindex_ESes&q=FEBI+BILSTEIN> (referer: None) ['cached']
2018-07-09 22:14:00 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-09 22:14:00 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 633,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 20858,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 7, 9, 20, 14, 0, 833000),
 'httpcache/hit': 2,
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 7, 9, 20, 14, 0, 594000)}
2018-07-09 22:14:00 [scrapy.core.engine] INFO: Spider closed (finished)

由于网站受 Cloudflare 保护，我安装了这个: https://github.com/clemfromspace/scrapy-cloudflare-middleware

当我修改我的settings.py时，我得到了下一个错误:

Traceback (most recent call last):
  File "C:\Users\Luis\Anaconda2\lib\site-packages\twisted\internet\defer.py", 
line 1386, in _inlineCallbacks
    result = g.send(result)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
98, in crawl six.reraise(*exc_info)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
80, in crawl self.engine = self._create_engine()
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\crawler.py", line 
105,in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\core\engine.py", 
line 69, in __init__
    self.downloader = downloader_cls(crawler)
  File "C:\Users\Luis\Anaconda2\lib\site- 
packages\scrapy\core\downloader\__init__.py", line 88, in __init__
   self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\middleware.py", line 
58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\middleware.py", line 
34, in from_settings mwcls = load_object(clspath)
  File "C:\Users\Luis\Anaconda2\lib\site-packages\scrapy\utils\misc.py", line 
44, in load_object
    mod = import_module(module)
  File "C:\Users\Luis\Anaconda2\lib\importlib\__init__.py", line 37, in 
import_module__import__(name)
ImportError: No module named scraping_hub.middlewares

此时我被卡住了。我不知道是否必须更改 settings.py 或 middlewares.py。

你能帮帮我吗？我想提高我的技能。 ;)

附言我已经添加了我的 middlewares.py:

from scrapy import signals


class MercadoSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

def process_spider_input(self, response, spider):
    # Called for each response that goes through the spider
    # middleware and into the spider.

    # Should return None or raise an exception.
    return None

def process_spider_output(self, response, result, spider):
    # Called with the results returned from the Spider, after
    # it has processed the response.

    # Must return an iterable of Request, dict or Item objects.
    for i in result:
        yield i

def process_spider_exception(self, response, exception, spider):
    # Called when a spider or process_spider_input() method
    # (from other spider middleware) raises an exception.

    # Should return either None or an iterable of Response, dict
    # or Item objects.
    pass

def process_start_requests(self, start_requests, spider):
    # Called with the start requests of the spider, and works
    # similarly to the process_spider_output() method, except
    # that it doesn’t have a response associated.

    # Must return only requests (not items).
    for r in start_requests:
        yield r

def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)


class MercadoDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.

@classmethod
def from_crawler(cls, crawler):
    # This method is used by Scrapy to create your spiders.
    s = cls()
    crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
    return s

def process_request(self, request, spider):
    # Called for each request that goes through the downloader
    # middleware.

    # Must either:
    # - return None: continue processing this request
    # - or return a Response object
    # - or return a Request object
    # - or raise IgnoreRequest: process_exception() methods of
    #   installed downloader middleware will be called
    return None

def process_response(self, request, response, spider):
    # Called with the response returned from the downloader.

    # Must either;
    # - return a Response object
    # - return a Request object
    # - or raise IgnoreRequest
    return response

def process_exception(self, request, exception, spider):
    # Called when a download handler or a process_request()
    # (from other downloader middleware) raises an exception.

    # Must either:
    # - return None: continue processing this exception
    # - return a Response object: stops process_exception() chain
    # - return a Request object: stops process_exception() chain
    pass

def spider_opened(self, spider):
    spider.logger.info('Spider opened: %s' % spider.name)

最佳答案

使用 scrapy_rotaing-proxies 逃脱:

pip install scrapy-rotating-proxies

关于python - Cloudflare 碎片，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51253538/

26

4

0

文章推荐： sql-server - 使用 "Latest record of group"选择在 View 上创建索引

文章推荐： jpa - 资源的 JNDI 查找失败

NHibernate 碎片
我今天在这里看到了 Ayende 关于 NH 分片的文章 http://ayende.com/blog/4252/nhibernate-shards-progress-report .我第一次听说 N
python - Cloudflare 碎片
我正在尝试使用 Scrapy 和 Cloudflare 抓取 URL，但我无法获得任何结果: 2018-07-09 22:14:00 [scrapy.core.engine] INFO: Spider
bash - 使用Bash循环删除未分配的Graylog2索引/碎片
我用于调试日志的Graylog2服务器上存在一些磁盘问题。现在有未分配的分片: curl -XGET http://host:9200/_cat/shards graylog_292 1 p STAR
python - 将网页中的数据放入列表中(碎片)
我正在做一个小机器人，它应该提供来自网站(ebay)的信息并使用 splinter 和 python 将其放入列表中。我的第一行代码: from splinter import Browser wit
python - 碎片:可见下拉菜单可点击但不可选择
我正在尝试通过 splinter 从模态框的下拉菜单中选择内容。我很容易找到这个下拉菜单，例如: (Pdb) dropdown = next(i for i in my_browser.find_by
linux - APC每次100%碎片
我的 APC 总是达到 100% 碎片。我的 VPS 有 1GB 内存，APC 分配给它 256mb，但它只使用了 256mb 中平均 100mb(最大 150mb)的内存。我必须重新启动 php-
python - 碎片:更快的查找元素的方法？
所以我正在使用 python splinter library测试一个网络应用程序，当我检查一个元素是否存在并且我手动找到每个元素来操作它时，我遇到了一个问题。问题是，当输入列表大于 4 项或更多时
python 碎片。如何选择选项部分中没有名称的下拉列表
我尝试从具有以下 html 代码的下拉列表中选择“本地主机”: Local Host ah005 这是我的 pyt
java - 使用 thymeleaf 碎片
我正在使用 Spring 和 Thymeleaf 开发应用程序，我想知道如何使用 thymeleaf 片段。 Thymeleaf 与 JSP 的优点是我们不必运行应用程序来查看模板，但是，当我们将模板
linux - 碎片(DF位)对rtt和ttl的影响
我在 linux 上用 ping 做了一些测试，我有点好奇 DF 位和碎片是如何工作的。我一直在发送一些带有命令 -M do 的包和一些带有 -M dont 的包，我意识到即使发送小于 MTU 的包，
c++ - MP4 碎片 - 在浏览器中播放时出现问题
我尝试从原始 H264 视频数据创建片段 MP4，以便我可以在互联网浏览器的播放器中播放它。我的目标是创建实时流媒体系统，媒体服务器会将碎片化的 MP4 片段发送到浏览器。服务器将缓冲来自 Raspb
mongodb 碎片，这种情况下有多少 mongod
在 mongodb 中。如果你想构建一个有两个分片的生产系统，每个分片都是一个具有三个节点的副本集，你必须启动多少 mongod 进程？为什么答案是9？最佳答案因为每个分片需要 3 个副本 x
python - 碎片 : storing the data
我是 python 和 scrapy 的新手。我正在尝试遵循 Scrapy 教程，但我不明白 storage step 的逻辑. scrapy crawl spidername -o items.js
mysql sharding(碎片)介绍
1、Sharding 的应用场景一般都那些？当数据库中的数据量越来越大时，不论是读还是写，压力都会变得越来越大。试想，如果一张表中的数据量达到了千万甚至上亿级别的时候，不管是建索引，优化缓存等，
Java UDP 服务器 IP 碎片
我正在通过以太网发送 2000 字节 JSON(以太网 MTU 1500 字节)，因为我的数据包大于以太网 MTU，所以我的消息被分段，如您在下面的 Wireshark 捕获中看到的那样。现在我正在尝
azure-cosmosdb - DocumentDB 索引性能/碎片
我决定为我的文档实现以下 ID 策略，它将文档“类型”与 ID 结合起来: doc.id = "docType_" + Guid.NewGuid().ToString("n"); // create
linux - ip6tables 设置阻止 ipv6 碎片
是否可以编写 ip6tables 规则来阻止格式错误的 ipv6 分段数据包。这基本上是为了我们电器盒的ipv6认证。我们运行在:rhel 5.5 和内核:2.6.18-238.1.1.el5 我们目
c# - LOH 碎片 - 2015 年更新
有很多关于 .NET LOH 的可用信息，并且已经在各种文章中进行了解释。但是，似乎有些文章不够精确。过时信息在Brian Rasmussen's answer (2009), program m
c# - 大型阵列和 LOH 碎片。公认的惯例是什么？
我还有一个 Unresolved 问题 HERE关于一些可能涉及 LOH 碎片以及其他未知数的绝望内存问题。我现在的问题是，公认的做事方式是什么？如果我的应用程序需要在 Visual C# 中完成，
sql - 复合主键/聚集索引、碎片、性能
经过 20 年的专业发展，我仍然发现自己对数据库性能的某些方面一无所知。这是那些时代之一。这里和其他地方有数以千计的关于表和索引碎片及其对性能影响的问题。我知道基本的注意事项，但有时似乎没有“好的”答

首页

博学

6Ren·AI

商城

python - Cloudflare 碎片