gpt4 book ai didi

perl - 递归网络爬虫 perl

转载 作者:行者123 更新时间:2023-12-01 09:02:44 24 4
gpt4 key购买 nike

我正在尝试编写一个最小的网络爬虫。目的是从种子中发现新的 URL 并进一步抓取这些新的 URL。代码如下:

use strict;
use warnings;
use Carp;
use Data::Dumper;
use WWW::Mechanize;

my $url = "http://foobar.com"; # example
my %links;

my $mech = WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
my @cr_fronteir = $mech->find_all_links();

foreach my $links (@cr_fronteir) {
if ( $links->[0] =~ m/^http/xms ) {
$links{$links->[0]} = $links->[1];
}
}

我被困在这里,如何进一步爬取 %links 中的链接,以及如何增加深度以防止溢出。欢迎提出建议。

最佳答案

Mojolicious web 框架提供了一些对网络爬虫有用的有趣功能:

  • 除 Perl v5.10 或更高版本外无依赖项
  • 网址解析器
  • DOM 树解析器
  • 异步 HTTP/HTTPS 客户端(允许并发请求,没有 fork() 开销)

这是一个递归抓取本地 Apache 文档并显示页面标题和提取链接的示例。它使用 4 个并行连接并且不超过 3 个路径级别,每个提取的链接只访问一次:

#!/usr/bin/env perl
use 5.010;
use open qw(:locale);
use strict;
use utf8;
use warnings qw(all);

use Mojo::UserAgent;

# FIFO queue
my @urls = (Mojo::URL->new('http://localhost/manual/'));

# User agent following up to 5 redirects
my $ua = Mojo::UserAgent->new(max_redirects => 5);

# Track accessed URLs
my %uniq;

my $active = 0;

sub parse {
my ($tx) = @_;

# Request URL
my $url = $tx->req->url;

say "\n$url";
say $tx->res->dom->at('html title')->text;

# Extract and enqueue URLs
for my $e ($tx->res->dom('a[href]')->each) {

# Validate href attribute
my $link = Mojo::URL->new($e->{href});
next if 'Mojo::URL' ne ref $link;

# "normalize" link
$link = $link->to_abs($tx->req->url)->fragment(undef);
next unless $link->protocol =~ /^https?$/x;

# Don't go deeper than /a/b/c
next if @{$link->path->parts} > 3;

# Access every link only once
next if ++$uniq{$link->to_string} > 1;

# Don't visit other hosts
next if $link->host ne $url->host;

push @urls, $link;
say " -> $link";
}

return;
}

sub get_callback {
my (undef, $tx) = @_;

# Parse only OK HTML responses
$tx->res->code == 200
and
$tx->res->headers->content_type =~ m{^text/html\b}ix
and
parse($tx);

# Deactivate
--$active;

return;
}

Mojo::IOLoop->recurring(
0 => sub {

# Keep up to 4 parallel crawlers sharing the same user agent
for ($active .. 4 - 1) {

# Dequeue or halt if there are no active crawlers anymore
return ($active or Mojo::IOLoop->stop)
unless my $url = shift @urls;

# Fetch non-blocking just by adding
# a callback and marking as active
++$active;
$ua->get($url => \&get_callback);
}
}
);

# Start event loop if necessary
Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

有关更多网络抓取提示和技巧,请阅读 I Don’t Need No Stinking API: Web Scraping For Fun and Profit文章。

关于perl - 递归网络爬虫 perl,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13899872/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com