gpt4 book ai didi

web-crawler - 在 nutch 中禁用 robots.txt 检查

转载 作者:行者123 更新时间:2023-12-04 02:59:03 28 4
gpt4 key购买 nike

我想禁用 robots.txt 检查 Nutch并从网站上抓取所有内容。Disable意味着在获取或解析任何网站之前,跳过检查robot.txt。
这可能吗?

最佳答案

虽然这个问题很老了,但我个人觉得还是有必要回答的。
是的,可以禁用 robots.txt flow(但您需要更改和构建 Nutch 源代码)。
注意:在获取实际 URL 之前,nutch 没有提供任何特定的 conf 来禁用 robots.txt 获取。因为您所说的听起来像是滥用 URL/域,并且无论网站试图通过 robots.txt 对其资源说什么,您都想访问它。
怎么可能?
如果您有一个真正需要跳过 robots.txt 的自定义用例,那么您可以执行以下操作
nutch 中的大多数插件(protocal-(http|httpclient|selenium|okhttp)) 使用 HttpRobotRulesParser 用于获取和解析 robots.txt 内容的类
HttpRobotsRulesParser 您可以在此特定方法中解析并返回 规则对象 .

public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
List<Content> robotsTxtContent) {

if (LOG.isTraceEnabled() && isWhiteListed(url)) {
LOG.trace("Ignoring robots.txt (host is whitelisted) for URL: {}", url);
}

String cacheKey = getCacheKey(url);
BaseRobotRules robotRules = CACHE.get(cacheKey);

if (robotRules != null) {
return robotRules; // cached rule
} else if (LOG.isTraceEnabled()) {
LOG.trace("cache miss " + url);
}

boolean cacheRule = true;
URL redir = null;

if (isWhiteListed(url)) {
// check in advance whether a host is whitelisted
// (we do not need to fetch robots.txt)
robotRules = EMPTY_RULES;
LOG.info("Whitelisted host found for: {}", url);
LOG.info("Ignoring robots.txt for all URLs from whitelisted host: {}",
url.getHost());

} else {
try {
URL robotsUrl = new URL(url, "/robots.txt");
Response response = ((HttpBase) http).getResponse(robotsUrl,
new CrawlDatum(), false);
if (robotsTxtContent != null) {
addRobotsContent(robotsTxtContent, robotsUrl, response);
}
// try one level of redirection ?
if (response.getCode() == 301 || response.getCode() == 302) {
String redirection = response.getHeader("Location");
if (redirection == null) {
// some versions of MS IIS are known to mangle this header
redirection = response.getHeader("location");
}
if (redirection != null) {
if (!redirection.startsWith("http")) {
// RFC says it should be absolute, but apparently it isn't
redir = new URL(url, redirection);
} else {
redir = new URL(redirection);
}

response = ((HttpBase) http).getResponse(redir, new CrawlDatum(), false);
if (robotsTxtContent != null) {
addRobotsContent(robotsTxtContent, redir, response);
}
}
}

if (response.getCode() == 200) // found rules: parse them
robotRules = parseRules(url.toString(), response.getContent(),
response.getHeader("Content-Type"), agentNames);

else if ((response.getCode() == 403) && (!allowForbidden))
robotRules = FORBID_ALL_RULES; // use forbid all
else if (response.getCode() >= 500) {
//cacheRule = false; // try again later to fetch robots.txt
robotRules = EMPTY_RULES;
} else
robotRules = EMPTY_RULES; // use default rules
} catch (Throwable t) {
if (LOG.isInfoEnabled()) {
LOG.info("Couldn't get robots.txt for " + url + ": " + t.toString());
}
//cacheRule = false; // try again later to fetch robots.txt
robotRules = EMPTY_RULES;
}
}

if (cacheRule) {
CACHE.put(cacheKey, robotRules); // cache rules for host
if (redir != null && !redir.getHost().equalsIgnoreCase(url.getHost())
&& "/robots.txt".equals(redir.getFile())) {
// cache also for the redirected host
// if the URL path is /robots.txt
CACHE.put(getCacheKey(redir), robotRules);
}
}

return robotRules;
}
你可以继续用下面的方法替换
  @Override
public BaseRobotRules getRobotRulesSet(Protocol http, URL url,
List<Content> robotsTxtContent) {
return EMPTY_RULES; // always return empty rules to skip robots.txt access.
}
您只是在通过返回 EMPTY_RULES 来 mock 行为。
Imp 注意:始终建议阅读和访问 robots.txt 中提到的资源。

关于web-crawler - 在 nutch 中禁用 robots.txt 检查,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14897058/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com