gpt4 book ai didi

python - Scrapy:重试图像下载后出现错误10054

转载 作者:太空宇宙 更新时间:2023-11-03 16:50:49 24 4
gpt4 key购买 nike

我正在 python 中运行 Scrapy 蜘蛛来从网站上抓取图像。其中一张图像无法下载(即使我尝试通过该网站定期下载它),这是该网站的内部错误。这很好,我不在乎尝试获取图像,我只想在图像失败时跳过该图像并移至其他图像,但我不断收到 10054 错误。

> Traceback (most recent call last):   File
> "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588,
> in _runCallbacks
> current.result = callback(current.result, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line 137,
> in parse_photo_page
> self.retrievePhoto(base_url_photo + url[0], url_text) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 49, in wrapped_f
> return Retrying(*dargs, **dkw).call(f, *args, **kw) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 212, in call
> raise attempt.get() File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 247, in get
> six.reraise(self.value[0], self.value[1], self.value[2]) File "C:\Python27\Scripts\nhtsa\nhtsa\retrying.py", line 200, in call
> attempt = Attempt(fn(*args, **kwargs), attempt_number, False) File "C:\Python27\Scripts\nhtsa\nhtsa\spiders\NHTSA_spider.py", line
> 216, in retrievePhoto
> code.write(f.read()) File "c:\python27\lib\socket.py", line 355, in read
> data = self._sock.recv(rbufsize) File "c:\python27\lib\httplib.py", line 612, in read
> s = self.fp.read(amt) File "c:\python27\lib\socket.py", line 384, in read
> data = self._sock.recv(left) error: [Errno 10054] An existing connection was forcibly closed by the remote

这是我的解析函数,它查看照片页面并找到重要的网址:

def parse_photo_page(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('td/font/a/@href').extract()
table_fields = sel.xpath('td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, "&nbsp","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)

这是我的带有重试装饰器的下载函数:

from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)

它重试下载 5 次,但随后抛出 10054 错误并且不继续下载下一个图像。如何让蜘蛛重试后继续?再说一次,我不在乎下载有问题的图像,我只是想跳过它。

最佳答案

您不应该在 scrapy 中使用 urllib 是正确的,因为它会阻止所有内容。尝试阅读与“scrapy twins”和“scrapy asynchronous”相关的资源。不管怎样......我不相信你的主要问题是“重试后继续”,而是在你的表达式上不使用“相关的xpaths”。这是一个适合我的版本(请注意 './td/font/a/@href' 中的 ./):

import scrapy
import string
import urllib
import os

class MyspiderSpider(scrapy.Spider):
name = "myspider"
start_urls = (
'file:index.html',
)

saveLocation = os.getcwd()

def parse(self, response):
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, "&nbsp","")
url_text = string.replace(url_text," ","")
self.retrievePhoto(base_url_photo + url[0], url_text)

from retrying import retry
@retry(stop_max_attempt_number=5, wait_fixed=2000)
def retrievePhoto(self, url, filename):
fullPath = self.saveLocation + "/" + filename
urllib.urlretrieve(url, fullPath)

这是一个(更好的)版本,它遵循您的模式,但使用 @paul trmbrth 提到的 ImagesPipeline

import scrapy
import string
import os

class MyspiderSpider(scrapy.Spider):
name = "myspider2"
start_urls = (
'file:index.html',
)

saveLocation = os.getcwd()

custom_settings = {
"ITEM_PIPELINES": {'scrapy.pipelines.images.ImagesPipeline': 1},
"IMAGES_STORE": saveLocation
}

def parse(self, response):
image_urls = []
image_texts = []
for sel in response.xpath('//table[@id="tblData"]/tr'):
url = sel.xpath('./td/font/a/@href').extract()
table_fields = sel.xpath('./td/font/text()').extract()
if url:
base_url_photo = "http://www-nrd.nhtsa.dot.gov/"
url_text = table_fields[3]
url_text = string.replace(url_text, "&nbsp","")
url_text = string.replace(url_text," ","")
image_urls.append(base_url_photo + url[0])
image_texts.append(url_text)

return {"image_urls": image_urls, "image_texts": image_texts}

我使用的演示文件是这样的:

$ cat index.html 
<table id="tblData"><tr>

<td><font>hi <a href="img/2015/cav.jpg"> foo </a> <span /> <span /> green.jpg </font></td>

</tr><tr>

<td><font>hi <a href="img/2015/caw.jpg"> foo </a> <span /> <span /> blue.jpg </font></td>

</tr></table>

关于python - Scrapy:重试图像下载后出现错误10054,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35852744/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com