gpt4 book ai didi

php - 编写一个打开网站页面并将页面内容存储在变量中的 PHP 脚本

转载 作者:行者123 更新时间:2023-12-04 05:07:51 27 4
gpt4 key购买 nike

我一直在构建一个搜索引擎,但现在我需要一个用 PHP 编写的网络爬虫,它可以爬取我的网站的内容。

我不知道网络爬虫/蜘蛛是否是正确的词,但我希望并想知道是否有人可以帮助我编写一个简单的 PHP 脚本,打开以 .php 或 .html 结尾的域中的所有页面并获取内容在页面中并将其作为原始文本存储在变量中。每页一个变量。

如果有人知道一个好的开源脚本可以做到这一点或可以帮助我编写一个脚本,请分享或这样做 - 我将不胜感激所有帮助。

最佳答案

退房 http://sourceforge.net/projects/php-crawler/

或者试试这个简单的代码来搜索 Google Analytics 跟踪代码的存在:

// Disable time limit to keep the script running
set_time_limit(0);
// Domain to start crawling
$domain = "http://webdevwonders.com";
// Content to search for existence
$content = "google-analytics.com/ga.js";
// Tag in which you look for the content
$content_tag = "script";
// Name of the output file
$output_file = "analytics_domains.txt";
// Maximum urls to check
$max_urls_to_check = 100;
$rounds = 0;
// Array to hold all domains to check
$domain_stack = array();
// Maximum size of domain stack
$max_size_domain_stack = 1000;
// Hash to hold all domains already checked
$checked_domains = array();

// Loop through the domains as long as domains are available in the stack
// and the maximum number of urls to check is not reached
while ($domain != "" && $rounds < $max_urls_to_check) {
$doc = new DOMDocument();

// Get the sourcecode of the domain
@$doc->loadHTMLFile($domain);
$found = false;

// Loop through each found tag of the specified type in the dom
// and search for the specified content
foreach($doc->getElementsByTagName($content_tag) as $tag) {
if (strpos($tag->nodeValue, $content)) {
$found = true;
break;
}
}

// Add the domain to the checked domains hash
$checked_domains[$domain] = $found;
// Loop through each "a"-tag in the dom
// and add its href domain to the domain stack if it is not an internal link
foreach($doc->getElementsByTagName('a') as $link) {
$href = $link->getAttribute('href');
if (strpos($href, 'http://') !== false && strpos($href, $domain) === false) {
$href_array = explode("/", $href);
// Keep the domain stack to the predefined max of domains
// and only push domains to the stack that have not been checked yet
if (count($domain_stack) < $max_size_domain_stack &&
$checked_domains["http://".$href_array[2]] === null) {
array_push($domain_stack, "http://".$href_array[2]);
}
};
}

// Remove all duplicate urls from stack
$domain_stack = array_unique($domain_stack);
$domain = $domain_stack[0];
// Remove the assigned domain from domain stack
unset($domain_stack[0]);
// Reorder the domain stack
$domain_stack = array_values($domain_stack);
$rounds++;
}

$found_domains = "";
// Add all domains where the specified search string
// has been found to the found domains string
foreach ($checked_domains as $key => $value) {
if ($value) {
$found_domains .= $key."\n";
}
}

// Write found domains string to specified output file
file_put_contents($output_file, $found_domains);

我找到了 here .

关于php - 编写一个打开网站页面并将页面内容存储在变量中的 PHP 脚本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15283341/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com