gpt4 book ai didi

Python Scraper - 如果目标是 404,则套接字错误会中断脚本

转载 作者:太空宇宙 更新时间:2023-11-04 09:16:25 25 4
gpt4 key购买 nike

在构建网络 scraper 以编译数据并输出为 XLS 格式时遇到错误;当再次测试我希望从中抓取的域列表时,程序在收到套接字错误时出错。希望找到一个“if”语句,该语句将使解析损坏的网站无效并继续我的 while 循环。有什么想法吗?

workingList = xlrd.open_workbook(listSelection)
workingSheet = workingList.sheet_by_index(0)
destinationList = xlwt.Workbook()
destinationSheet = destinationList.add_sheet('Gathered')
startX = 1
startY = 0
while startX != 21:
workingCell = workingSheet.cell(startX,startY).value
print ''
print ''
print ''
print workingCell
#Setup
preSite = 'http://www.'+workingCell
theSite = urlopen(preSite).read()
currentSite = BeautifulSoup(theSite)
destinationSheet.write(startX,0,workingCell)

这里是错误:

Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
homeMenu()
File "C:\Python27\farming.py", line 31, in homeMenu
openList()
File "C:\Python27\farming.py", line 79, in openList
openList()
File "C:\Python27\farming.py", line 83, in openList
openList()
File "C:\Python27\farming.py", line 86, in openList
homeMenu()
File "C:\Python27\farming.py", line 34, in homeMenu
startScrape()
File "C:\Python27\farming.py", line 112, in startScrape
theSite = urlopen(preSite).read()
File "C:\Python27\lib\urllib.py", line 84, in urlopen
return opener.open(url)
File "C:\Python27\lib\urllib.py", line 205, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 342, in open_http
h.endheaders(data)
File "C:\Python27\lib\httplib.py", line 951, in endheaders
self._send_output(message_body)
File "C:\Python27\lib\httplib.py", line 811, in _send_output
self.send(msg)
File "C:\Python27\lib\httplib.py", line 773, in send
self.connect()
File "C:\Python27\lib\httplib.py", line 754, in connect
self.timeout, self.source_address)
File "C:\Python27\lib\socket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
IOError: [Errno socket error] [Errno 11004] getaddrinfo failed

最佳答案

嗯,这看起来像是我在互联网连接中断时遇到的错误。 HTTP 404 错误是当您有连接但找不到您指定的 URL 时遇到的错误。

没有 if 语句来处理异常;你需要使用 try/except construct. 来“捕捉”它们

更新:这是一个演示:

import urllib

def getconn(url):
try:
conn = urllib.urlopen(url)
return conn, None
except IOError as e:
return None, e

urls = """
qwerty
http://www.foo.bar.net
http://www.google.com
http://www.google.com/nonesuch
"""
for url in urls.split():
print
print url
conn, exc = getconn(url)
if conn:
print "connected; HTTP response is", conn.getcode()
else:
print "failed"
print exc.__class__.__name__
print str(exc)
print exc.args

输出:

qwerty
failed
IOError
[Errno 2] The system cannot find the file specified: 'qwerty'
(2, 'The system cannot find the file specified')

http://www.foo.bar.net
failed
IOError
[Errno socket error] [Errno 11004] getaddrinfo failed
('socket error', gaierror(11004, 'getaddrinfo failed'))

http://www.google.com
connected; HTTP response is 200

http://www.google.com/nonesuch
connected; HTTP response is 404

请注意,到目前为止我们只是打开了连接。现在您需要做的是检查 HTTP 响应代码并确定是否有任何值得使用 conn.read()

检索的内容

关于Python Scraper - 如果目标是 404,则套接字错误会中断脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8860200/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com