gpt4 book ai didi

Python、Mechanize - 即使在 set_handle_robots 和 add_headers 之后,robots.txt 也不允许请求

转载 作者:行者123 更新时间:2023-11-28 18:47:16 24 4
gpt4 key购买 nike

我制作了一个网络爬虫,它获取所有链接直到页面的第一级,并从中获取所有链接和文本以及图像链接和 alt。这是完整的代码:

import urllib
import re
import time
from threading import Thread
import MySQLdb
import mechanize
import readability
from bs4 import BeautifulSoup
from readability.readability import Document
import urlparse

url = ["http://sparkbrowser.com"]

i=0

while i<len(url):

counterArray = [0]

levelLinks = []
linkText = ["homepage"]
levelLinks = []

def scraper(root,steps):
urls = [root]
visited = [root]
counter = 0
while counter < steps:
step_url = scrapeStep(urls)
urls = []
for u in step_url:
if u not in visited:
urls.append(u)
visited.append(u)
counterArray.append(counter +1)
counter +=1
levelLinks.append(visited)
return visited

def scrapeStep(root):
result_urls = []
br = mechanize.Browser()
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]

for url in root:
try:
br.open(url)

for link in br.links():
newurl = urlparse.urljoin(link.base_url, link.url)
result_urls.append(newurl)
#levelLinks.append(newurl)
except:
print "error"
return result_urls


scraperOut = scraper(url[i],1)

for sl,ca in zip(scraperOut,counterArray):
print "\n\n",sl," Level - ",ca,"\n"

#Mechanize
br = mechanize.Browser()
page = br.open(sl)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
#BeautifulSoup
htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)


for linkins in br.links(text_regex=re.compile('^((?!IMG).)*$')):
newesturl = urlparse.urljoin(linkins.base_url, linkins.url)
linkTxt = linkins.text
print newesturl,linkTxt

for linkwimg in soup.find_all('a', attrs={'href': re.compile("^http://")}):
imgSource = linkwimg.find('img')
if linkwimg.find('img',alt=True):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = linkwimg.img['alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt
elif linkwimg.find('img',alt=False):
imgLink = linkwimg['href']
#imageLinks.append(imgLink)
imgAlt = ['No Alt']
#imageAlt.append(imgAlt)
print imgLink,imgAlt

i+=1

一切都很好,直到我的爬虫到达他无法阅读的 facebook 链接 之一,但他给了我错误

httperror_seek_wrapper:HTTP 错误 403:robots.txt 不允许请求

第 68 行是:page = br.open(sl)

现在我不知道为什么,因为如您所见,我已经设置了 Mechanize set_handle_robotsadd_headers 选项。

我不知道为什么会这样,但我注意到我收到了 facebook 链接的错误,在本例中是 facebook.com/sparkbrowser 和 google .

欢迎任何帮助或建议。

干杯

最佳答案

好的,所以本题出现了同样的问题:

Why is mechanize throwing a HTTP 403 error?

通过发送普通浏览器会发送的所有请求 header ,并接受/发回服务器发送的 cookie 应该可以解决问题。

关于Python、Mechanize - 即使在 set_handle_robots 和 add_headers 之后,robots.txt 也不允许请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18096885/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com