gpt4 book ai didi

java - 使用crawler4j在类之间传输一个对象

转载 作者:行者123 更新时间:2023-11-30 03:01:17 26 4
gpt4 key购买 nike

我是一个简单的网络爬虫,它是使用crawler4j的构建 block 构建的。我试图在爬虫爬行时构建一个字典,然后在构建和解析文本时将其传递给我的主( Controller )。由于我的 MyCrawler 对象不是在我的主类中创建的(使用 MyCrawler.class 作为第一个参数),我该如何执行此操作?另外,我无法更改controller.start方法。我希望能够在爬虫完成后使用爬虫中创建的字典。

我能想到的最好方法是让controller.start采用预定义和创建的MyCrawler对象,但据我所知,没有办法做到这一点。

下面是我的代码。非常感谢您的帮助!

爬虫:

public class MyCrawler extends WebCrawler
{
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg|png|mp3|mp3|zip|gz))$");
public ArrayList<String> dictionary = new ArrayList<String>();

@Override public boolean shouldVisit(Page referringPage, WebURL url)
{
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://lyle.smu.edu/~fmoore"));
}

@Override public void visit(Page page)
{
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if(page.getParseData() instanceof HtmlParseData)
{
HtmlParseData h = (HtmlParseData)page.getParseData();
String text = h.getText();

String[] words = text.split(" ");
for(int i = 0;i < words.length;i++)
{
if(!words[i].equals("") || !words[i].equals(null) || !words[i].equals("\n"))
dictionary.add(words[i]);
}

String html = h.getHtml();
Set<WebURL> links = h.getOutgoingUrls();

System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
System.out.println(text);
}
}
}

Controller :

public class Controller 
{
public ArrayList<String> dictionary = new ArrayList<String>();

public static void main(String[] args) throws Exception
{
int numberOfCrawlers = 1;
String crawlStorageFolder = "/data/crawl/root";

CrawlConfig c = new CrawlConfig();
c.setCrawlStorageFolder(crawlStorageFolder);
c.setMaxDepthOfCrawling(-1); //Unlimited Depth
c.setMaxPagesToFetch(-1); //Unlimited Pages
c.setPolitenessDelay(200); //Politeness Delay

PageFetcher pf = new PageFetcher(c);
RobotstxtConfig robots = new RobotstxtConfig();
RobotstxtServer rs = new RobotstxtServer(robots, pf);
CrawlController controller = new CrawlController(c, pf, rs);

controller.addSeed("http://lyle.smu.edu/~fmoore");

controller.start(MyCrawler.class, numberOfCrawlers);

controller.shutdown();
controller.waitUntilFinish();
}
}

最佳答案

WebCrawlerFactory 创建您的 MyCrawler 对象。这应该可以解决问题(至少从 4.2 版本开始)。但是,您的字典应该支持并发访问(简单的ArrayList不支持!)

// use a factory, instead of supplying the crawler type to pass the dictionary
controller.start(new WebCrawlerFactory<MyCrawler>() {
@Override
public MyCrawler newInstance() throws Exception {
return new MyCrawler(dictionary);
}
}, numberOfCrawlers);

关于java - 使用crawler4j在类之间传输一个对象,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35872956/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com