python - Twisted Python 失败 - Scrapy 问题-6ren

python - Twisted Python 失败 - Scrapy 问题

转载作者：行者123 更新时间：2023-11-28 16:31:38

我正在尝试使用 SCRAPY 来抓取该网站的任何搜索查询的搜索结果 - http://www.bewakoof.com .

网站使用AJAX(XHR形式)显示搜索结果。我设法跟踪了 XHR，您在我的代码中注意到了它，如下所示(在 for 循环内，其中我将 URL 存储到 temp，并在循环中递增“i”)-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp=( "http://www.bewakoof.com/search/searchload/search_text/" + query + "/page_num/" + str(i) )
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)


spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

现在，当我执行此操作时，出现意外错误:

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

如果您正确地看到我的代码，我也设置了 DOWNLOAD_DELAY=5，但它仍然会给出与我没有保留它时相同的错误。我还增加了 DOWNLOAD_DELAY=10，但它仍然给出相同的错误。我已经在 Stack Overflow 和 GitHub 上阅读了很多与此相关的问题，但似乎都没有帮助。

我在其中一个答案中读到，带有 Polipo 的 TOR 可以提供帮助。但是，我对使用它有点怀疑，因为我不知道使用 TOR 和 Polipo 的组合来使用 Scrapy 抓取网站是否合法？ (我不想遇到任何法律问题。)这就是我不喜欢使用它的原因。因此，如果它是合法的，请使用 TOR 和 POLIPO 提供我的具体案例的代码。

或者更确切地说，如果那是非法的，请帮助我在不使用它们的情况下解决它。

请帮我解决这些错误!

编辑:

这是我更新的代码-:

from twisted.internet import reactor
from scrapy.crawler import CrawlerProcess, CrawlerRunner
import scrapy
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from scrapy.settings import Settings
import datetime
from multiprocessing import Process, Queue
import os
from scrapy.http import Request
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
from scrapy.signalmanager import SignalManager
import re

query='shirt'
query1=query.replace(" ", "+")  


class DmozItem(scrapy.Item):

    productname = scrapy.Field()
    product_link = scrapy.Field()
    current_price = scrapy.Field()
    mrp = scrapy.Field()
    offer = scrapy.Field()
    imageurl = scrapy.Field()
    outofstock_status = scrapy.Field()




class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["http://www.bewakoof.com"]

    def _monkey_patching_HTTPClientParser_statusReceived(self):

        from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError
        old_sr = HTTPClientParser.statusReceived
        def statusReceived(self, status):
            try:
                return old_sr(self, status)
            except ParseError, e:
                if e.args[0] == 'wrong number of parts':
                    return old_sr(self, status + ' OK')
                raise
        statusReceived.__doc__ == old_sr.__doc__
        HTTPClientParser.statusReceived = statusReceived




    def start_requests(self):

        task_urls = [
        ]
        i=1
        for i in range(1,2):
            temp = "http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1"
            task_urls.append(temp)
            i=i+1

        start_urls = (task_urls)
        p=len(task_urls)
        print 'hi'
        self._monkey_patching_HTTPClientParser_statusReceived()
        return [ Request(url = start_url) for start_url in start_urls ]
        print 'hi'

    def parse(self, response):
        print 'hi'
        print response
        items = []
        for sel in response.xpath('//html/body/div[@class="main-div-of-product-item"]'):
            item = DmozItem()
            item['productname'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@title').extract())[17:-6]
            item['product_link'] = "http://www.bewakoof.com"+str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@href').extract())[3:-2]
            item['current_price']='Rs. ' + str(sel.xpath('div[1]/div[@class="product_info"]/div[@class="product_price_nomrp"]/span[1]/text()').extract())[3:-2]

            item['mrp'] = item['current_price']

            item['offer'] = str('No additional offer available')

            item['imageurl'] = str(sel.xpath('div[1]/span[@class="lazyImage"]/span[1]/a/img[@id="main_image"]/@data-original').extract())[3:-2]
            item['outofstock_status'] = str('In Stock')
            items.append(item)

        print (items)

spider1 = DmozSpider()
settings = Settings()
settings.set("PROJECT", "dmoz")
settings.set("DOWNLOAD_DELAY" , 5)
crawler = CrawlerProcess(settings)
crawler.crawl(spider1)
crawler.start()

这是我更新后的输出，显示在终端上:

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

所以，如您所见，错误仍然相同! :( 。所以，请帮我解决这个问题!

更新-:

这是当我 try catch @JoeLinux 建议的异常时的输出:

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

最佳答案

同样的错误

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

现在可以了。

我想你可以试试这个:

在方法 _monkey_patching_HTTPClientParser_statusReceived 中，将 from scrapy.xlib.tx._newclient import HTTPClientParser, ParseError 更改为 from twisted.web._newclient import HTTPClientParser,解析错误;
在方法 start_requests 中，为 start_urls 中的每个请求调用 _monkey_patching_HTTPClientParser_statusReceived，例如:def start_requests( self ): 对于 self.start_urls 中的 url: self._monkey_patching_HTTPClientParser_statusReceived() yield Request(url, dont_filter=True)

希望对您有所帮助。

关于python - Twisted Python 失败 - Scrapy 问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31309631/

文章推荐： python - 将 PDF 文件转换为 Base64 以索引到 Elasticsearch

文章推荐： python - 旋转 Pandas 数据框以生成(seaborn)热图

文章推荐： html - 使未知数量的 div 很好地填充

文章推荐： python - 与python全局变量混淆，pygame事件

php - fopen 失败，getaddrinfo 失败
我在使用以下代码时遇到问题: function http_file_exists($url){ $f=fopen($url,"r"); if($f){ fclose($f); retu
Git 推送到 Azure 失败(RPC 失败)
我已经通过 Git 部署到 Azure 几个月了，没有出现重大问题，但现在我似乎遇到了一个无法克服的错误。我创建了一个新的 Azure 网站，为正在开发的项目创建单独的预览链接。我在新站点上设置了
flutter - Flutter:构建应用程序 bundle 失败:失败:构建失败，出现异常
我已经通过flutter创建了一个App并完成了它，我想在flutter文档中阅读时进行部署。我收到此错误: FAILURE: Build failed with an exception. * W
powershell - Powershell ForEach循环间歇性工作-工作，失败，工作，失败，工作等
我在Windows 10中使用一些简单的Powershell代码遇到了这个奇怪的问题，我认为这可能是我做错了，但我不是Powershell的天才。我有这个: $ix = [System.Net.Dn
curl - 快速 JSON 失败，断言 `IsObject()' 失败
我正在尝试使用 RapidJSON 解析从服务器接收到的数据。以下是收到的确切字符串: [ { "Node": "9478149a08f9", "Address": "172.17
c++ - 为 iOS 编译 OpenCV 失败 - Cmake 失败
我尝试为 ios 编译 OpenCV。我总是收到这些错误。我用不同版本的opencv试了一下，结果都是一样的。我运行这个:python 平台/ios/build_framework.py ios_o
c# - Redis Server 失败，Socket 失败，超出输出缓冲区限制，我该如何增加限制？
我在一台机器上做基本的发布/订阅，我的客户端是 StackExchange-Redis 的 C# 客户端，我在同一台机器上运行基于 Windows 的 Redis 服务器(服务器版本 2.8.4) 当
php - 编辑 php mysqli CRUD 失败 - 仅更新数据 mysql 失败
我有这段代码，但无法执行，请帮我解决这个问题连接 connect_error) { die ("connection failed: " . $terhubung->connect_erro
java - (警告)扫描 JAR 失败(错误)处理 JAR 失败
我在 tomcat 上运行并由 maven 编译的 Web 应用程序给出了以下警告和错误。我可以在本地存储库中看到所有 JAR，但有人可以帮忙吗。 WARNING: Failed to scan JA
android - ndk-build 失败，process_begin : CreateProcess(NULL, uname -a, ...) 失败
我正在 Windows 8 上使用 Android Studio 开发一个 android 应用程序，我正在使用一些 native 代码。突然间我无法编译我的 C 文件。当我运行 ndk-build
c++ - 使用 boost 反序列化二进制数据时 static_assert 失败 "typex::value"失败
下面的代码对类和结构的成员进行序列化和反序列化。序列化工作正常，但我在尝试使用 oarch >> BOOST_SERIALIZATION_NVP(outObj); 反序列化时遇到了以下错误；代码中是
ruby-on-rails - Rspec 测试以 'rspec' 失败，但以 'rspec path_of_the_file' 失败
如果我运行此命令“rspec ./spec/requests/api/v1/password_reset_request_spec.rb”，此文件中的所有测试都会通过。但是，当我运行“rspec”时
angular - 加载文件 Protractor 失败 - 失败 : script timeout: result was not received in 11 seconds
我在尝试执行测试以使用 Protractor 上传文件时出错，我的代码是这个 it('it should be possible to upload a file', function() {
java - System.loadLibrary 失败，错误为 : java. lang.UnsatisfiedLinkError:dlopen 失败:无法找到符号 "__aeabi_memmove4"
System.loadLibrary("nativefaceswap"); 当我运行我的应用程序时，我在 Android Studio 中发现了此类错误。在logcat中显示: java.lang.U
git - SSL_ERROR_SYSCALL 和 gnutls_handshake() 失败 - SSL/HTTPS 失败 [Ubuntu Server 18.04]
我希望有人能帮助我!使用任何方法或命令行的任何 SSL/HTTPS 调用均无效。我在 Windows 10 中使用 Ubuntu Server 18.04 作为子系统。我的问题是昨天才开始出现的，因
mysql - Django South - db.alter 函数删除 null=true Blank=true 失败，mysql 失败
通过删除这两个值将日期字段从 null=True 和 Blank=True 更改为 required 时，使用 db.alter 命令时遇到问题。当以下行被注释掉时，迁移运行不会出现问题。
linux - python setup.py egg_info"失败，错误代码为 1 in/tmp/pip-build-zsm7incx/phonenumbers/- 失败
我第一次使用 Heroku 尝试创建应用程序(使用 SendGrid 的 Inbound Parse Webhook"和 Twilio SMS 通过电子邮件发送和接收 SMS 消息)。通过 Virtu
ios - Xcode 7 失败，Command/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/swiftc 失败，退出代码为 1
我正在将我的 swift 项目更新到 Xcode 7 上的 Swift 2.0。xcode 在构建项目时报告了以下错误: 命令/Applications/Xcode.app/Contents/Deve
SSL_library_init() 失败
在我的代码中，SSL 库函数 SSL_library_init() 没有按预期返回 1。我如何才能看到它返回了什么错误？我在 SSL_library_init() 之后调用了 SSL_load_er
jQuery .load() 失败
我正在尝试运行在以下链接中找到的答案: Asynchronously Load the Contents of a Div 但是当我这样做时，我会遇到我不太理解的错误。我的代码: $(documen

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

python - Twisted Python 失败 - Scrapy 问题