perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？-6ren

perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？

转载作者：行者123 更新时间：2023-12-02 19:30:26

25

4

我用 perl 编写了一个基本的网络爬虫。我怎样才能让它变得更复杂以“广度优先”的方式检索，就像 wget 那样？

这来自wget docs :

Recursive retrieval of HTTP and HTML/CSS content is breadth-first. This means that Wget first downloads the requested document, then the documents linked from that document, then the documents linked by them, and so on. In other words, Wget first downloads the documents at depth 1, then those at depth 2, and so on until the specified maximum depth.

任何有关我的代码的评论也将不胜感激。

use feature 'say';
use WWW::Mechanize;
use List::MoreUtils 'any';

##############
# parameters #
##############
my $url = "https://www.crawler-test.com/"; # starting url
my $depth_level = 2; # depth level for crawling (level 1 will only look at links on the starting page)
my $filter = ".pdf"; # for multiple types use this format: ".pdf|.docx|.doc|.rtf"
my $wait = 2; # wait this number of seconds between http requests (be kind to the server)
my $domain = ""; # only crawl links with host ending in this string, leave blank if not required. For multiple domains, use this format: "domain1|domain2"
my $nocrawlagain = 1; # setting this to 1 will mean that the same link will not be crawled again, set to 0 to turn this off
##############


$domain = quotemeta($domain);
$domain =~ s/\\\|/|/g;

my @linkscrawled;

open LOG, ">mecherrors.log";
LOG->autoflush;

my $mech = WWW::Mechanize->new(stack_depth => 0, onerror => \&mecherror);

sub crawl {

    my $url = shift;
    my $filter = shift;
    my $depth = shift || 1;

    return if $depth > $depth_level;

    say "Crawling $url";
    $mech->get($url);
    sleep $wait;
    return unless ($mech->success and $mech->is_html);


    my @linkstocrawl;

    for $link ($mech->find_all_links(url_abs_regex => qr/^http/))  # only get http links (excludes things like mailto:)
    {

        next if $link->url =~ /#/;  # excludes URLs that are referring to an anchor

        # if the link matches the filter then download it
        if ($link->url =~ /($filter)$/)
        {
            my $urlfilename = ($link->URI->path_segments)[-1];
            next if -e $urlfilename;
            $mech->get($url); # go to base page
            sleep $wait;
            $mech->get($link->url);
            sleep $wait;
            my $filename = $mech->response->filename;
            next if -e $filename;
            $mech->save_content($filename);
            say "Saved $filename";

        } else {

            push @linkstocrawl, $link;

        }
    }

    for $link (@linkstocrawl)
    {
        next unless $link->url_abs->host =~ /($domain)$/;
        if ($nocrawlagain)
        {
            # skip if already crawled this link
            next if any { $_ eq $link->url_abs->abs } @alreadycrawled;
            push @alreadycrawled, $link->url_abs->abs;
        }
        crawl($link->url_abs->abs, $filter, $depth + 1);
    }

}


crawl($url, $filter);

sub mecherror {
    print LOG "[", $mech->uri, "] ", $mech->response->message, "\n";
}

最佳答案

如果你想做广度优先，你需要从subcrawl中取出my @linkstocrawl声明，这样就只有一个主待办事项列表，而不是而不是每次调用爬行子程序时都有一个单独的列表。

如果您使代码成为非递归的，那么执行广度优先也会更容易，因为递归或多或少自动适合深度优先。 (当您递归调用子程序来处理搜索空间的一部分时，该子程序将在该部分完全完成之前不会返回，这不是您想要的广度优先。)

因此，您想要的一般结构类似于(不完整或经过测试的代码):

my @linkstocrawl = $starting_url;
my %linkscrawled; # hash instead of array for faster/easier lookups

while (my $url = shift @linkstocrawl) {
  next if exists $linkscrawled{$url}; # already saw it, so skip it
  $linkscrawled{$url}++;

  my $page = fetch($url);
  push @linkstocrawl, find_links_on($page);
  # you could also push the links onto @linkstocrawl one-by-one, depending on
  # whether you prefer to parse the page incrementally or grab them all at once

  # Do whatever else you want to do with $page
}

关于perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/61906213/

25

4

0

文章推荐： sql-server - 在非常大的数据库中设置主键

文章推荐： javascript - 如何使jquery mobile中可折叠项目的标题位于中心

文章推荐： javascript，如何按类从输入中获取值？

breadth-first-search - 使用邻接矩阵表示的广度优先搜索的时间复杂度？
在 bfs 中，我们必须查找每个节点，对于每个节点，我们必须查看行的所有元素。这不需要 O(V^2)(邻接矩阵中的元素数)时间，因此对于邻接矩阵不应该总时间为 O(V^2+E)。最佳答案使用邻接矩
breadth-first-search - 广度优先搜索的迷宫求解
有人可以解释一下如何使用广度优先搜索来解决迷宫吗？我需要使用广度优先搜索来找到通过迷宫的最短路径，但是我很困惑。这是我书中的伪代码: void breadth_first_search(tree T
perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？
我用 perl 编写了一个基本的网络爬虫。我怎样才能让它变得更复杂以“广度优先”的方式检索，就像 wget 那样？这来自wget docs : Recursive retrieval of HTTP
perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？
我用 perl 编写了一个基本的网络爬虫。我怎样才能让它变得更复杂以“广度优先”的方式检索，就像 wget 那样？这来自wget docs : Recursive retrieval of HTTP
breadth-first-search - 广度优先遍历有向图与无向图
有向图和无向图上的 bfs 在实现上有何不同。我在网上找到了以下伪代码。我对无向图没问题。但不知道如何实现有向图。 frontier = new Queue() mark root visit
c++ - 图论 : Breadth First Search
有n个顶点由m条边相连。有些顶点是特殊的，有些是普通的。至多有一条路径从一个顶点移动到另一个顶点。第一个查询:我需要找出有多少对直接或间接相连的特殊顶点。我的方法:我将应用 BFS(通过队列)来查
algorithm - 状态空间搜索 : A* and Breadth First Search
所以我为推箱子游戏实现了 2 个不同的求解器。求解器很简单，给定一个起始状态(位置)，如果初始状态是目标状态，则返回结果。否则生成子状态并将它们存储到与算法相对应的任何数据结构中。 (BFS 的队列
c - C 中的 BST 链表 : Breadth First Search
我正在编写一个程序，它是二叉搜索树的链接列表。我们应该在树中搜索一个数字并打印找到的树和行号。因此，我们应该使用广度优先搜索函数。我的出队函数中出现段错误，但我不确定原因。这些是我的结构: type
breadth-first-search - Javascript - 使用 BFS 从头开始 getElementbyID？
我正在尝试学习 javascript，并花了今晚使用广度优先搜索编写 getElementByID() 函数。简而言之:我迷路了。 fiddle :http://jsfiddle.net/timdow
algorithm - 如果在 Breadth-FirstSearch(BFS) 算法中使用堆栈而不是 queueq 会发生什么？
如果在 Breadth-FirstSearch(BFS) 算法中使用堆栈而不是 queueq 会发生什么？ pseudocode of BFS with queue: BFS(node)
breadth-first-search - BFS 在邻接矩阵列表 O(m+n) 上如何？
我试图弄清楚 BFS 如何是 O(m+n)，其中 n 是顶点数，m 是边数。算法是: public void bfs() { //BFS uses Queue data structure
JavaScript promise : Recursively Building Promise Chain With Breadth-First Traversal
原生 Javascript ES5/ES6 promise 我正在尝试导入具有递归关系的数据，因为数据库(mongodb)正在分配 id - 必须加载父级(异步)在它的 child 可以被加载之前(也
python - os.walk() 或等效的 Python 函数中是否有可用的 “breadth-first” 搜索选项？
示例目录树: root /|\ / | \ / | \ A B C / \ / \ D E
html - 使用带有 minmax(, 1fr) 的重复自动调整防止 CSS 网格列溢出
在使用 repeat(auto-fit, minmax(, 1fr)) 时，有什么方法可以防止列大于网格宽度吗？？我希望有一个解决方案可以避免媒体查询，从而使组件更易于重用。网格中的每个元素都应始
javascript - 全局语法 : Breadth first traversal to get all items in a folder using Karma
我正在编写 Karma 来测试一些 AngularJS 代码。为了使测试正确运行，某些模块必须先于其他模块运行。 Controller 和服务在模块实例化之后出现。我的代码现在的结构方式(非常简化)
algorithm - POSIX ls -R 是否规定了特定的遍历顺序？如果不是，那么哪个假设更可能是可移植的 : depth-first or breadth-first?
我正在开发一个存储与文件系统的 inode 树非常相似的东西的系统。它已经具有与 ls 命令等效的功能，但尚不支持递归选项。我正在研究添加递归选项的实现选择。我想最大限度地提高了解 POSIX ls
algorithm - 不知情搜索 : run breadth-first search followed by iterative deepening search on each node in the frontier
我正在尝试结合广度优先搜索和迭代加深搜索。 AI 书 AI - 一种现代方法，第 3 章(第 90 页)中提到了这种方法。思路是从初始状态开始，进行广度优先搜索，直到达到某个恒定的内存限制mB，然后在

首页

博学

6Ren·AI

商城

perl - 如何让perl网络爬虫像wget一样进行 'breadth first'检索？