gpt4 book ai didi

perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索?

转载 作者:行者123 更新时间:2023-12-02 19:30:26 25 4
gpt4 key购买 nike

我用 perl 编写了一个基本的网络爬虫。我怎样才能让它变得更复杂以“广度优先”的方式检索,就像 wget 那样?

这来自wget docs :

Recursive retrieval of HTTP and HTML/CSS content is breadth-first. This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.

任何有关我的代码的评论也将不胜感激。

use feature 'say';
use WWW::Mechanize;
use List::MoreUtils 'any';

##############
# parameters #
##############
my $url = "https://www.crawler-test.com/"; # starting url
my $depth_level = 2; # depth level for crawling (level 1 will only look at links on the starting page)
my $filter = ".pdf"; # for multiple types use this format: ".pdf|.docx|.doc|.rtf"
my $wait = 2; # wait this number of seconds between http requests (be kind to the server)
my $domain = ""; # only crawl links with host ending in this string, leave blank if not required. For multiple domains, use this format: "domain1|domain2"
my $nocrawlagain = 1; # setting this to 1 will mean that the same link will not be crawled again, set to 0 to turn this off
##############


$domain = quotemeta($domain);
$domain =~ s/\\\|/|/g;

my @linkscrawled;

open LOG, ">mecherrors.log";
LOG->autoflush;

my $mech = WWW::Mechanize->new(stack_depth => 0, onerror => \&mecherror);

sub crawl {

my $url = shift;
my $filter = shift;
my $depth = shift || 1;

return if $depth > $depth_level;

say "Crawling $url";
$mech->get($url);
sleep $wait;
return unless ($mech->success and $mech->is_html);


my @linkstocrawl;

for $link ($mech->find_all_links(url_abs_regex => qr/^http/)) # only get http links (excludes things like mailto:)
{

next if $link->url =~ /#/; # excludes URLs that are referring to an anchor

# if the link matches the filter then download it
if ($link->url =~ /($filter)$/)
{
my $urlfilename = ($link->URI->path_segments)[-1];
next if -e $urlfilename;
$mech->get($url); # go to base page
sleep $wait;
$mech->get($link->url);
sleep $wait;
my $filename = $mech->response->filename;
next if -e $filename;
$mech->save_content($filename);
say "Saved $filename";

} else {

push @linkstocrawl, $link;

}
}

for $link (@linkstocrawl)
{
next unless $link->url_abs->host =~ /($domain)$/;
if ($nocrawlagain)
{
# skip if already crawled this link
next if any { $_ eq $link->url_abs->abs } @alreadycrawled;
push @alreadycrawled, $link->url_abs->abs;
}
crawl($link->url_abs->abs, $filter, $depth + 1);
}

}


crawl($url, $filter);

sub mecherror {
print LOG "[", $mech->uri, "] ", $mech->response->message, "\n";
}

最佳答案

如果你想做广度优先,你需要从subcrawl中取出my @linkstocrawl声明,这样就只有一个主待办事项列表,而不是而不是每次调用爬行子程序时都有一个单独的列表。

如果您使代码成为非递归的,那么执行广度优先也会更容易,因为递归或多或少自动适合深度优先。 (当您递归调用子程序来处理搜索空间的一部分时,该子程序将在该部分完全完成之前不会返回,这不是您想要的广度优先。)

因此,您想要的一般结构类似于(不完整或经过测试的代码):

my @linkstocrawl = $starting_url;
my %linkscrawled; # hash instead of array for faster/easier lookups

while (my $url = shift @linkstocrawl) {
next if exists $linkscrawled{$url}; # already saw it, so skip it
$linkscrawled{$url}++;

my $page = fetch($url);
push @linkstocrawl, find_links_on($page);
# you could also push the links onto @linkstocrawl one-by-one, depending on
# whether you prefer to parse the page incrementally or grab them all at once

# Do whatever else you want to do with $page
}

关于perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61906213/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com