gpt4 book ai didi

java - Crawler4j,一些网址被毫无问题地抓取而另一些则根本没有被抓取

转载 作者:塔克拉玛干 更新时间:2023-11-01 23:09:00 24 4
gpt4 key购买 nike

我一直在玩弄 Crawler4j,并成功地让它抓取了一些页面,但没有成功抓取其他页面。例如,我已经使用以下代码成功抓取了 Reddi:

public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
int numberOfCrawlers = 1;

CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);

/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);

/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("https://www.reddit.com/r/movies");
controller.addSeed("https://www.reddit.com/r/politics");


/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}


}

还有:

@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("https://www.reddit.com/");
}

在 MyCrawler.java 中。但是,当我尝试抓取 http://www.ratemyprofessors.com/ 时该程序只是在没有输出的情况下挂起,并且不会抓取任何内容。我在 myController.java 中像上面一样使用以下代码:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

在 MyCrawler.java 中:

 @Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ratemyprofessors.com/");
}

所以我想知道:

  • 某些服务器是否能够立即识别爬虫并不允许它们收集数据?
  • 我注意到 RateMyProfessor 页面是 .jsp 格式;这与它有什么关系吗?
  • 有什么方法可以更好地调试它吗?控制台不输出任何东西。

最佳答案

crawler4j 尊重爬虫策略,例如 robots.txt .在您的情况下,此文件如下 one .

检查此文件会发现,不允许抓取您给定的种子点:

 Disallow: /ShowRatings.jsp 
Disallow: /campusRatings.jsp

这个理论得到了crawler4j日志输出的支持:

2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044

关于java - Crawler4j,一些网址被毫无问题地抓取而另一些则根本没有被抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34257022/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com