python - 尽管设置为忽略，但 Mechanize 返回 robot.txt-6ren

python - 尽管设置为忽略，但 Mechanize 返回 robot.txt

转载作者：太空宇宙更新时间：2023-11-04 05:56:05

25

4

我遇到过一些网站，当我尝试提取代码时返回 ROBOTS 元标记，并且即使在尝试使用 Mechanize 时也会继续这样做。例如:

import mechanize
br = mechanize.Browser()
br.set_handle_robots(False)
br.open("http://myanimelist.net/anime.php?letter=B")
response = br.response().read()

我试过设置 header 和其他句柄，但从来没有得到不是 ROBOTS 元标记的响应。

非常感谢任何帮助，谢谢。

编辑:

尝试下面建议的 header :

import mechanize
url = "http://myanimelist.net/anime.php?letter=B"

br = mechanize.Browser()
br.set_handle_robots(False)
br.addheaders=[('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36     (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'),
       ('Host', 'myanimelist.net'),
       ('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'),
       ('Accept-Encoding', 'gzip, deflate, sdch'),
       ('Accept-Language', 'en-US,en;q=0.8,ru;q=0.6'),
       ('Cache-Control', 'max-age=0'),
       ('Connection', 'keep-alive')]
br.open(url)
response = br.response().read()
print response

我仍然得到相同的 ROBOTs 元标记。我是否添加了错误的 header ，或者我只是卡在了验证码上？

谢谢你的帮助，我很感激。

最佳答案

据我了解set_handle_robots()只是关于遵循 robots.txt 中列出的规则:

def set_handle_robots(self, handle):
    """Set whether to observe rules from robots.txt."""

顺便说一句，您应该尊重并成为一名优秀的网络抓取公民。

不过，他们对网络抓取非常严格——你很容易被设置为验证码——要小心。仅供引用，它们受 Incapsula 保护它具有高级 Bot Protection:

Using advanced client classification technology, crowdsourcing and reputation-based techniques, Incapsula distinguishes between "good" and "bad" bot traffic. This lets you block scrapers, vulnerability scanners and comment spammers that overload your servers and steal your content, while allowing search engines and other legitimate services to freely access your website.

另一个重要的“仅供引用”，引自"Terms of Use" :

You agree not to use or launch any automated system, including without limitation, "robots," "spiders," "offline readers," etc. , that accesses the Service in a manner that sends more request messages to the Company servers than a human can reasonably produce in the same period of time by using a conventional on-line web browser, and you agree not to aggregate or collate any of the content available through the Service for use elsewhere. You also agree not to collect or harvest any personally identifiable information, including account or profile names, from the Service nor to use the communication systems provided by the Service for any commercial solicitation purposes.

这引出了我的实际答案:有一个 official API提供，使用它。站在法律的一边。

关于python - 尽管设置为忽略，但 Mechanize 返回 robot.txt，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27763085/

25

4

0

文章推荐： c - if条件不成功

文章推荐：指针递增的 C 运算符优先级

文章推荐： python - 寻找基因中的外显子/内含子边界

mechanize - Mechanize python中的模块错误
我正在使用 mechanize python 登录网站 combochat2.us用户名 mask3和密码findnext ，但它显示了“没有找到 Mechanize 模块”之类的错误 import
mechanize - 如何避免 Mechanize 解析文件或图像的 url？
我在我的 rails 应用程序中使用 gem mechanize 来抓取网页数据。我这样使用它: agent = Mechanize.new document = agent.get("http:/
python, mechanize - 使用 mechanize 打开文本文件
我正在学习机械。我正在尝试打开一个文本文件，您点击的链接显示文本 (.prn)我遇到的一个问题是此页面上只有 1 个表单，并且该文件不在表单中。对我来说另一个问题是此页面上有几个文本文件，但它们都具
mechanize - Beautiful Soup 在解析 Mechanize 输出时遇到问题
def return_with_soup(url): #uses mechanize to tell the browser we aren't a bot #and to retri
python - 如何通过python中的 Mechanize 来 Mechanize 返回页面内的网址？
我正在开发一个项目，使用 python 和 Mechanize 。我有个问题 : Mechanize 返回的页面，有不是的 URLS Mechanize ，如果用户点击它，他们将通过链接他们自己计算
html - Ruby Mechanize - 如何在 Mechanize 解析站点响应之前解析它？
问题: 解析网站时，有些字符会导致 Mechanize 无法正确解析。提出的解决方案解析来自网站的响应以删除这些字符在 Mechanize 之前尝试解析它。或者，在 Mechanize 解析网
ruby - 如何将新字段添加到 Mechanize 表单( ruby /Mechanize )
有一个public class method将字段添加到 Mechanize 表单我试过了.. #login_form.field.new('auth_login','Login') #login_
ruby-on-rails - Mechanize 重定向/Nokogiri(菜鸟使用 Mechanize )
我有一些看起来像这样的东西: def self.foo agent = Mechanize.new form = agent.get("link/to/form/url") form.f
ruby - 是否可以将 Mechanize::File 转换为 Mechanize::Page
我在使用 Mechanize gem 时遇到问题，如何转换 Mechanize::文件进入 Mechanize::页面 , 这是我的一段代码: **link** = page.link_with(:
ruby - 如何从 Mechanize::Page 的搜索方法中获取 Mechanize 对象？
我正在尝试抓取一个只能依靠类和元素层次结构来找到正确节点的站点。但是使用 Mechanize::Page#search 返回 Nokogiri::XML::Element，我不能用它来填写和提交表单等
ruby - Mechanize 的局限性是什么？ mechanize 和 watir 之间的区别是什么
我正在使用 mechanize 来抓取一些网页。我需要知道什么是 Mechanize 限制？ Mechanize 不能做什么？它可以执行网页中嵌入的javascripts吗？我可以用它来调用 j
perl - WWW::Mechanize 中的基本表单方法在 WWW::Mechanize::PhantomJS 中不起作用
在 WWW::Mechanize 中使用表单方法 my @form = $mech->form_number(1); foreach my $sum_form ( @form ) {
mechanize - 如何使用 Mechanize 单击没有 id 和 name 的提交按钮？
找到以下 HTML 代码: 如何使用 Mechanize 单击没有 id 和 name 的提交按钮？最佳答案我已经找到了此类场景的答案，代码如下: agent = Mechanize.new
ruby-on-rails - Rails rake Mechanize - 错误 - 没有要加载的文件 - Mechanize
这个问题不太可能对 future 的访客有帮助；它只与一个小的地理区域、一个特定的时刻或一个非常狭窄的情况相关，通常不适用于互联网的全局受众。如需帮助使这个问题更广泛地适用，visit the hel
ruby - Mechanize/ ruby : `require' : cannot load such file -- mechanize (LoadError)
我一直在尝试使用以下方法从终端运行 ruby 文件: ruby file_cleanse_auto.rb 但是我从 mechanize 得到一个错误: /Library/Ruby/Site/2.0
Ruby Mechanize gem，从本地 html 副本恢复 Mechanize::Page 对象
这是我拥有的代码: agent = Mechanize.new page = agent.get 'http://google.com' page.save 'google_index.htm' 我怎
python - Python Mechanize 错误 - "mechanize._mechanize.BrowserStateError: not viewing HTML"
for link in br.links(url_regex="inquiry-results.jsp"): cb[link.url] = link for page_link in cb.v
ruby-on-rails - 如何从 Mechanize::File 对象转换为 Mechanize::Page 对象？
我有一个登录表单的页面。登录后有一些重定向。第一个看起来像这样: #"no-cache=\"set-cookie\"", "content-length"=>"114", "set-cookie"=>
mechanize - 获取 Mechanize::UnauthorizedError: 401 => Net::HTTPUnauthorized 使用基本身份验证访问 API 时
我正在尝试使用基本身份验证访问 API。它适用于 HTTParty，但不适用于 2.7.6 Mechanize。这是我尝试过的: agent = Mechanize.new agent.log =
Ruby 的 Mechanize 在选择单选按钮时会犹豫(可能是因为它的名称大写)但 Perl 的 WWW::Mechanize 工作正常
我正在尝试使用 Ruby 的 Mechanize gem 提交表单。此表单有一组名为“KeywordType”的单选按钮。各个按钮的名称类似于 rdoAny、rdoAll 和 rdoPhrase。使用

首页

博学

6Ren·AI

商城

python - 尽管设置为忽略，但 Mechanize 返回 robot.txt