gpt4 book ai didi

python - 尽管设置为忽略,但 Mechanize 返回 robot.txt

转载 作者:太空宇宙 更新时间:2023-11-04 05:56:05 25 4
gpt4 key购买 nike

我遇到过一些网站,当我尝试提取代码时返回 ROBOTS 元标记,并且即使在尝试使用 Mechanize 时也会继续这样做。例如:

import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://myanimelist.net/anime.php?letter=B")
response = br.response().read()

我试过设置 header 和其他句柄,但从来没有得到不是 ROBOTS 元标记的响应。

非常感谢任何帮助,谢谢。

编辑:

尝试下面建议的 header :

import mechanize
url = "http://myanimelist.net/anime.php?letter=B"

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders=[('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'),
('Host', 'myanimelist.net'),
('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
('Accept-Encoding', 'gzip, deflate, sdch'),
('Accept-Language', 'en-US,en;q=0.8,ru;q=0.6'),
('Cache-Control', 'max-age=0'),
('Connection', 'keep-alive')]
br.open(url)
response = br.response().read()
print response

我仍然得到相同的 ROBOTs 元标记。我是否添加了错误的 header ,或者我只是卡在了验证码上?

谢谢你的帮助,我很感激。

最佳答案

据我了解set_handle_robots()只是关于遵循 robots.txt 中列出的规则:

def set_handle_robots(self, handle):
"""Set whether to observe rules from robots.txt."""

顺便说一句,您应该尊重并成为一名优秀的网络抓取公民。


不过,他们对网络抓取非常严格——你很容易被设置为验证码——要小心。仅供引用,它们受 Incapsula 保护它具有高级 Bot Protection:

Using advanced client classification technology, crowdsourcing and reputation-based techniques, Incapsula distinguishes between "good" and "bad" bot traffic. This lets you block scrapers, vulnerability scanners and comment spammers that overload your servers and steal your content, while allowing search engines and other legitimate services to freely access your website.

另一个重要的“仅供引用”,引自"Terms of Use" :

You agree not to use or launch any automated system, including without limitation, "robots," "spiders," "offline readers," etc. , that accesses the Service in a manner that sends more request messages to the Company servers than a human can reasonably produce in the same period of time by using a conventional on-line web browser, and you agree not to aggregate or collate any of the content available through the Service for use elsewhere. You also agree not to collect or harvest any personally identifiable information, including account or profile names, from the Service nor to use the communication systems provided by the Service for any commercial solicitation purposes.

这引出了我的实际答案:有一个 official API提供,使用它。站在法律的一边。

关于python - 尽管设置为忽略,但 Mechanize 返回 robot.txt,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27763085/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com