gpt4 book ai didi

java - 带有递归的 WebCrawler

转载 作者:行者123 更新时间:2023-12-01 13:19:47 25 4
gpt4 key购买 nike

所以我正在开发一个网络爬虫,它应该下载所有图像、文件和网页,然后递归地对找到的所有网页执行相同的操作。不过我好像有逻辑错误。

    public class WebCrawler {

private static String url;
private static int maxCrawlDepth;
private static String filePath;

/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
*/

public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {

webpage.crawl(currentCrawlDepth);

HashMap<String, WebPage> pages = webpage.getCrawledWebPages();

if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
}
}
}

public static void main(String[] args) {

if(args.length != 3) {
System.out.println("Must pass three parameters");
System.exit(0);
}

url = "";
maxCrawlDepth = 0;
filePath = "";

url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
urlConnection.connect();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
System.exit(0);
} catch (IOException e) {
System.out.println("Could not open URL");
System.exit(0);
}

try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
System.exit(0);
}

filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
System.exit(0);
}

WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);

System.out.println("Web crawl is complete");
}

}

函数抓取将解析网站的内容,将任何找到的图像、文件或链接存储到 HashMap 中,例如:

    public class WebPage implements WebElement {

private static Elements images;
private static Elements links;

private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();

private String url;

public WebPage(String url) {
this.url = url;
}

/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
*/
public void crawl(int currentCrawlDepth) {

System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");

Document doc = null;

try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
httpConnection.ignoreContentType(true);
doc = httpConnection.get();

} catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
} catch (IOException e) {
System.out.println(e.getLocalizedMessage());
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
}

DownloadRepository downloadRepository = DownloadRepository.getInstance();

if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");

for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
}
}

HttpConnection mimeConnection = null;
Response mimeResponse = null;

for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeConnection.ignoreContentType(true);
mimeConnection.ignoreHttpErrors(true);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}

String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
}

if(contentType == null) {
continue;
}
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
}
}
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
}
}

}
}

}

System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
}

public HashMap<String, WebImage> getImages() {
return webImages;
}

public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
}

public HashMap<String, WebFile> getFiles() {
return files;
}

public String getUrl() {
return url;
}

@Override
public void saveToDisk(String filePath) {
System.out.println(filePath);
}
}

使用 HashMap 的目的是确保我不会多次解析同一个网站。该错误似乎与我的递归有关。这是什么问题?

这里还有一些在 http://www.google.com 处开始爬网的示例输出

Crawling https://www.google.com/ at crawl depth 0

Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0

Crawling https://www.google.com/services/ at crawl depth 1

Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**

Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/

Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2

**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**

请注意,它解析 http://www.google.com/intl/en/policies/两次

最佳答案

您正在为每个网页创建一个新 map 。这将确保如果相同的链接在页面上出现两次,则只会被抓取一次,但不会处理相同链接出现在两个不同页面上的情况。

https://www.google.com/intl/en/policies/ 出现在 https://www.google.com/ 上https://www.google.com/services/

为避免这种情况,请在整个抓取过程中使用单个 map ,并将其作为参数传递到递归中。

public class WebCrawler {

private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();

public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {

}
}

由于您还持有图像等的 map ,因此您可能更喜欢创建一个新对象,也许将其称为visited,并使其保持跟踪。

public class Visited {

private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();

public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
}
webPages.put(url, page);
return true;
}

private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();

public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
}
webImages.put(url, image);
return true;
}

}

关于java - 带有递归的 WebCrawler,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22141880/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com