gpt4 book ai didi

java - 增加爬虫的线程数

转载 作者:行者123 更新时间:2023-12-01 19:15:58 25 4
gpt4 key购买 nike

This is the code taken from http://code.google.com/p/crawler4j/ and the name of this file is MyCrawler.java


public class MyCrawler extends WebCrawler {

Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

/*
* You should implement this function to specify
* whether the given URL should be visited or not.
*/
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://www.xyz.us.edu/")) {
return true;
}
return false;
}

/*
* This function is called when a page is fetched
* and ready to be processed by your program
*/
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
}
}

这是调用 MyCrawler 的 Controller.java 代码..

public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.xyz.us.edu/");
controller.start(MyCrawler.class, 10);
}
}

所以我只是想确定这行在controller.java文件中意味着什么

controller.start(MyCrawler.class, 10);

这里 10 的含义是什么..如果我们将这个 10 增加到 20 那么会产生什么效果...任何建议将不胜感激...

最佳答案

This网站显示了 CrawlController 的源代码。

从 10 增加到 20 会增加爬网程序的数量(每个爬网程序都在自己的线程中) - 研究该代码将告诉您这会产生什么影响。

关于java - 增加爬虫的线程数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/6683764/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com