web-crawler - 在 nutch 中禁用 robots.txt 检查-6ren

web-crawler - 在 nutch 中禁用 robots.txt 检查

转载作者：行者123 更新时间：2023-12-04 02:59:03

28

4

我想禁用 robots.txt 检查 Nutch并从网站上抓取所有内容。Disable意味着在获取或解析任何网站之前，跳过检查robot.txt。
这可能吗？

最佳答案

虽然这个问题很老了，但我个人觉得还是有必要回答的。
是的，可以禁用 robots.txt flow(但您需要更改和构建 Nutch 源代码)。
注意:在获取实际 URL 之前，nutch 没有提供任何特定的 conf 来禁用 robots.txt 获取。因为您所说的听起来像是滥用 URL/域，并且无论网站试图通过 robots.txt 对其资源说什么，您都想访问它。
怎么可能？
如果您有一个真正需要跳过 robots.txt 的自定义用例，那么您可以执行以下操作
nutch 中的大多数插件(protocal-(http|httpclient|selenium|okhttp)) 使用 HttpRobotRulesParser 用于获取和解析 robots.txt 内容的类
在 HttpRobotsRulesParser 您可以在此特定方法中解析并返回 规则对象 .

public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
      List<Content> robotsTxtContent) {

    if (LOG.isTraceEnabled() && isWhiteListed(url)) {
      LOG.trace("Ignoring robots.txt (host is whitelisted) for URL: {}", url);
    }

    String cacheKey = getCacheKey(url);
    BaseRobotRules robotRules = CACHE.get(cacheKey);

    if (robotRules != null) {
      return robotRules; // cached rule
    } else if (LOG.isTraceEnabled()) {
      LOG.trace("cache miss " + url);
    }

    boolean cacheRule = true;
    URL redir = null;

    if (isWhiteListed(url)) {
      // check in advance whether a host is whitelisted
      // (we do not need to fetch robots.txt)
      robotRules = EMPTY_RULES;
      LOG.info("Whitelisted host found for: {}", url);
      LOG.info("Ignoring robots.txt for all URLs from whitelisted host: {}",
          url.getHost());

    } else {
      try {
        URL robotsUrl = new URL(url, "/robots.txt");
        Response response = ((HttpBase) http).getResponse(robotsUrl,
            new CrawlDatum(), false);
        if (robotsTxtContent != null) {
          addRobotsContent(robotsTxtContent, robotsUrl, response);
        }
        // try one level of redirection ?
        if (response.getCode() == 301 || response.getCode() == 302) {
          String redirection = response.getHeader("Location");
          if (redirection == null) {
            // some versions of MS IIS are known to mangle this header
            redirection = response.getHeader("location");
          }
          if (redirection != null) {
            if (!redirection.startsWith("http")) {
              // RFC says it should be absolute, but apparently it isn't
              redir = new URL(url, redirection);
            } else {
              redir = new URL(redirection);
            }

            response = ((HttpBase) http).getResponse(redir, new CrawlDatum(), false);
            if (robotsTxtContent != null) {
              addRobotsContent(robotsTxtContent, redir, response);
            }
          }
        }

        if (response.getCode() == 200) // found rules: parse them
          robotRules = parseRules(url.toString(), response.getContent(),
              response.getHeader("Content-Type"), agentNames);

        else if ((response.getCode() == 403) && (!allowForbidden))
          robotRules = FORBID_ALL_RULES; // use forbid all
        else if (response.getCode() >= 500) {
          //cacheRule = false; // try again later to fetch robots.txt
          robotRules = EMPTY_RULES;
        } else
          robotRules = EMPTY_RULES; // use default rules
      } catch (Throwable t) {
        if (LOG.isInfoEnabled()) {
          LOG.info("Couldn't get robots.txt for " + url + ": " + t.toString());
        }
        //cacheRule = false; // try again later to fetch robots.txt
        robotRules = EMPTY_RULES;
      }
    }

    if (cacheRule) {
      CACHE.put(cacheKey, robotRules); // cache rules for host
      if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())
          && "/robots.txt".equals(redir.getFile())) {
        // cache also for the redirected host
        // if the URL path is /robots.txt
        CACHE.put(getCacheKey(redir), robotRules);
      }
    }

    return robotRules;
  }

你可以继续用下面的方法替换

  @Override
  public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
      List<Content> robotsTxtContent) {
       return EMPTY_RULES; // always return empty rules to skip robots.txt access.
    }

您只是在通过返回 EMPTY_RULES 来 mock 行为。
Imp 注意:始终建议阅读和访问 robots.txt 中提到的资源。

关于web-crawler - 在 nutch 中禁用 robots.txt 检查，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14897058/

28

4

0

文章推荐： gradle - gradle任务分组/封装

文章推荐： haskell - 编译器开关打开/关闭调试消息？

文章推荐： codeigniter - 'value2' 中的未知列 'field list'

string - grep 两个文件 (a.txt, b.txt) - b.txt 中有多少行以 a.txt 中的单词开始(或结束) - 输出 : 2 files with the results
我知道我要求太多，但也许你也可以帮助解决这个问题。 a.txt 包含单词，b.txt 包含字符串。我想知道 b.txt 中有多少个字符串以 a.txt 中的单词结尾例子:一个.txt apple
linux - 将 1.txt、2.txt ... 10.txt 连接成一个文件
这个问题在这里已经有了答案: erge text files ordered by numerical filenames in Bash (3 个答案) 关闭 4 年前。我有一个文件夹，其中包含
windows - 如何批量替换目录中的文件 windows vista(从 .txt.txt 到 .txt)
我在一个目录中有几个平面文件 (.txt)。所有这些文件的格式都是 *.txt.txt，所以我想将其重命名为 *.txt？有什么简单的方法可以一起重命名？当我尝试 ren *.txt.txt *.t
linux - Ubuntu 上的基本 bash 命令 : wc -l < file1. txt > file2.txt vs wc -l < file1.txt > file1.txt
这个问题在这里已经有了答案: How can I use a file in a command and redirect output to the same file without trunc
robots.txt - 为什么在javascript文件上使用robot.txt？
您是否有任何理由应该或不应该允许访问 javascript 或 css 文件？特别是常见的文件，如 jquery。最佳答案人们普遍认为，搜索引擎每天为给定站点分配一定数量的带宽或 URL。因此，一
robots.txt - Googlebots忽略了robots.txt？
Closed. This question is off-topic。它当前不接受答案。想要改善这个问题吗？ Update the question，所以它是用于堆栈溢出的on-topic。已关闭
c - 我想读取一个名为 input.txt(某个名称)的文本文件，并将偶数和奇数单词分成两个不同的文件名 Even.txt 和 odd.txt
这是相同的代码。我面临的问题是，我无法在任何文件上写入任何内容。请帮忙解决这个问题 #include #include int main() { FILE *fe; FILE *fo;
apache - 使用 htaccess 的域特定机器人文件将 robots.txt 重写为 example.com.txt 或回退到 default.txt
我想要特定于域的 robots.txt，到目前为止这有效: RewriteRule ^robots\.txt$ robots/%{HTTP_HOST}.txt [L] 但我希望有一个后备方案，因此如果
sql-server - "> sql.txt && sql -h-1 -i sql.txt && del sql.txt"命令是什么意思？
我正在调试一些构建成功运行的 SQL 命令的代码。然而，在查询结束时，查询结果似乎被写入了一个文本文件。完整的查询如下 echo SELECT DATE,DATETABLE,DATE,APPDAT
linux - cat source.txt | cat source.txt 和有什么区别grep x 和 grep x source.txt？
这个问题已经有答案了: difference between grep Vs cat and grep (5 个回答) 已关闭 8 年前。我看到一个例子，其中有人这样做: cat source.tx
sql - 在Sql Bulk Insert语句中，我们可以使用相对路径(files\a.txt)而不是绝对路径(c :\abc\a. txt)或网络通用路径(\\abc\a.txt)吗？
我想将表中的数据从以 csv 格式存储的文本文件插入到 sql server 表中。为此，我正在使用批量插入语句。现在我需要在“From”子句中指定文件名。我不想在那里使用网络位置或本地位置。我想将我
robots.txt - robots.txt 是否适用于子域？
假设我有一个测试文件夹 (test.domain.com) 并且我不希望搜索引擎在其中抓取，我是否需要在测试文件夹中有一个 robots.txt 或者我可以只放置一个 robots.txt在根目录中，
robots.txt - 如何禁止所有动态网址 robots.txt
关闭。这个问题是off-topic .它目前不接受答案。想改善这个问题吗？ Update the question所以它是 on-topic对于堆栈溢出。 9年前关闭。 Improve this q
robots.txt - robots.txt 中的顺序重要吗？
这个问题在这里已经有了答案: order of directives in robots.txt, do they overwrite each other or complement each ot
robots.txt - robots.txt 的伦理
关闭。这个问题是opinion-based .它目前不接受答案。想改进这个问题？更新问题，以便 editing this post 可以用事实和引用来回答它. 8年前关闭。 Improve this
robots.txt - hackers.txt 文件有什么用？
已关闭。这个问题是 not about programming or software development 。目前不接受答案。这个问题似乎不是关于 a specific programming
asp.net - txt 名字和名字 Txt
在过去的几年中，当我引用“名字”字段的文本框控件时，我一直使用 FirstNameTxt 命名约定。但是，我注意到大多数其他开发人员倾向于使用命名约定 txtFirstName 哪个是最好的约定？为什
robots.txt - robots.txt 中只允许目录中的一个文件吗？
我只想允许目录 /minsc 中的一个文件，但我想禁止该目录的其余部分。现在 robots.txt 中是这样的: User-agent: * Crawl-delay: 10 # Directorie
robots.txt - 请求机器人重新解析 robots.txt
我正在编写一个将 youtube.com 映射到另一个域的代理服务器(因此用户可以轻松地从德国等国家/地区访问 youtube，而无需审查搜索结果和视频)。不幸的是，我的 robots.txt 中存
Powershell:使用字符串中的文件名从一个 .txt 创建多个 .txt
我没有编程技能，但有一项非常具体的任务:我必须将一个庞大的文本文件拆分成多个，并在特定的文本标记 (@) 处拆分它们。我决定尝试使用 Powershell 脚本来完成此任务。到目前为止，这就是我想出

首页

博学

6Ren·AI

商城

web-crawler - 在 nutch 中禁用 robots.txt 检查