gpt4 book ai didi

webserver - 网络爬虫在对网络服务器的重复请求之间等待的最佳持续时间是多少

转载 作者:行者123 更新时间:2023-12-04 15:28:28 25 4
gpt4 key购买 nike

是否有一些标准的持续时间,爬虫必须在重复命中同一台服务器之间等待,以免使服务器负担过重。

如果没有,任何关于什么是爬虫的良好等待期的建议都可以被认为是礼貌的。

这个值是否也因服务器而异……如果是这样,如何确定它?

最佳答案

这篇关于 IBM 的文章详细介绍了 how the Web crawler uses the robots exclusion protocolrecrawl interval settings in the Web crawler
引用文章。

The first time that a page is crawled, the crawler uses the date and time that the page is crawled and an average of the specified minimum and maximum recrawl intervals to set a recrawl date. The page will not be recrawled before that date. The time that the page will be recrawled after that date depends on the crawler load and the balance of new and old URLs in the crawl space.

Each time that the page is recrawled, the crawler checks to see if the content has changed. If the content has changed, the next recrawl interval will be shorter than the previous one, but never shorter than the specified minimum recrawl interval. If the content has not changed, the next recrawl interval will be longer than the previous one, but never longer than the specified maximum recrawl interval.


这是关于他们的网络爬虫,但在构建自己的工具时阅读非常有用。

关于webserver - 网络爬虫在对网络服务器的重复请求之间等待的最佳持续时间是多少,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/799239/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com