gpt4 book ai didi

php - 如何使用 PHP 检测爬虫/蜘蛛?

转载 作者:可可西里 更新时间:2023-11-01 12:49:06 25 4
gpt4 key购买 nike

如何使用 PHP 检测爬虫/蜘蛛?

我目前正在做一个项目,我需要跟踪每个爬虫的访问。
我知道你应该使用 HTTP_USER_AGENT 但我不太确定如何为此目的格式化代码而且我知道可以很容易地更改 USER AGENT 所以我也想知道是否可以添加一些更多的参数来避免欺骗?

我正在尝试做的示例代码..

<?php
$user_agent = $_SERVER['HTTP_USER_AGENT'];
if (strpos( $user_agent, 'Google') !== false)
{
echo "Googlebot is here";
}
?>

谢谢

最佳答案

根据 Verifying Googlebot :

You can verify that a bot accessing your server really is Googlebot (or another Google user-agent) by using a reverse DNS lookup, verifying that the name is in the googlebot.com domain, and then doing a forward DNS lookup using that googlebot name. This is useful if you're concerned that spammers or other troublemakers are accessing your site while claiming to be Googlebot.

For example:

host 66.249.66.1
1.66.249.66.in-addr.arpa domain name pointer
crawl-66-249-66-1.googlebot.com.

host crawl-66-249-66-1.googlebot.com
crawl-66-249-66-1.googlebot.com has address 66.249.66.1
Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard coded them. The best way to identify accesses by Googlebot is to use the user-agent (Googlebot).

您可以进行反向 DNS 查找:

function validateGoogleBotIP($ip) {
$hostname = gethostbyaddr($ip); //"crawl-66-249-66-1.googlebot.com"

return preg_match('/\.google(bot)?\.com$/i', $hostname);
}

if (strpos($_SERVER['HTTP_USER_AGENT'], 'Google') !== false) {
if (validateGoogleBotIP($_SERVER['REMOTE_ADDR'])) {
echo 'It is ACTUALLY google';
} else {
echo 'Someone\'s faking it!';
}
} else {
echo 'Nothing to do with Google';
}

关于php - 如何使用 PHP 检测爬虫/蜘蛛?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19980363/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com