gpt4 book ai didi

php - Web Crawler - 从 2000 多个网页中获取数据(TED 网站示例)

转载 作者:行者123 更新时间:2023-11-27 22:34:23 25 4
gpt4 key购买 nike

我正在编写一个每周运行一次的 php crone 作业脚本

此脚本的主要目的是从 TED 上可用的所有 TED 演讲中获取详细信息我们网站(例如为了让这个问题更容易理解)

这个脚本将需要大约 70 分钟的时间来运行,它会遍历 2000 多个网页

我的问题是:

1) 是否有更好/更快的方法来每次获取网页,我正在使用该功能:

file_get_contents_curl($url)

2) 将所有的会谈放在一个数组中(可能会变得非常大)是一个好习惯吗

3) 有没有更好的方法来从网站获取所有 ted 演讲的详细信息?在 TED 网站上“爬取”所有演讲的最佳方式是什么

**我检查了使用 RSS 提要的选项,但它缺少我需要的一些细节。

谢谢

<?php
define("START_ID", 1);
define("STOP_TED_QUERY",20);
define ("VALID_PAGE","TED | Talks");
/**
* this script will run as a cron job and will go over all pages
* on TED http://www.ted.com/talks/view/id/
* from id 1 till there are no more pages
*/

/**
* function get a file using curl (fast)
* @param $url - url which we want to get its content
* @return the data of the file
* @author XXXXX
*/
function file_get_contents_curl($url)
{
$ch = curl_init();

curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);

$data = curl_exec($ch);
curl_close($ch);

return $data;
}

//will hold all talks in array
$tedTalks = array();

//id to start the query from
$id=START_ID;

//will indicate when needed to stop the query beacuse reached the end id's on TED website
$endOFQuery=0;

//get the time
$time_start = microtime(true);

//start the query on TED website
//if we will query 20 pages in a row that do not exsist we will stop the querys and assume there are no more
while ($endOFQuery < STOP_TED_QUERY){

//get the page of the talk
$html = file_get_contents_curl("http://www.ted.com/talks/view/id/$id");

//parsing begins here:
$doc = new DOMDocument();
@$doc->loadHTML($html);
$nodes = $doc->getElementsByTagName('title');

//get and display what you need:
$title = $nodes->item(0)->nodeValue;


//check if this a valid page
if (! strcmp ($title , VALID_PAGE ))
//this is a removed ted talk or the end of the query so raise a flag (if we get anough of these in a row we will stop)
$endOFQuery++;
else {
//this is a valid TED talk get its details

//reset the flag for end of query
$endOFQuery = 0;

//get meta tags
$metas = $doc->getElementsByTagName('meta');

//get the tag we need (keywords)
for ($i = 0; $i < $metas->length; $i++)
{
$meta = $metas->item($i);
if($meta->getAttribute('name') == 'keywords')
$keywords = $meta->getAttribute('content');
}

//create new talk object and populate it
$talk = new Talk();
//set its ted id from ted web site
$talk->setID($id);
//parse the name (name has un-needed char's in the end)
$talk->setName( substr($title, 0, strpos( $title, '|')) );

//parse the String of tags to array
$keywords = explode(",", $keywords);
//remove un-needed items from it
$keywords=array_diff($keywords, array("TED","Talks"));

//add the filters tags to the talk
$talk->setTags($keywords);

//add to the total talks array
$tedTalks[]=$talk;
}

//move to the next ted talk ID to query
$id++;
} //end of the while

$time_end = microtime(true);
$execution_time = ($time_end - $time_start);
echo "this took (sec) : ".$execution_time;

?>

最佳答案

在github.com上有一个网络爬虫php例子

如果有人正在寻找它是如何工作的

https://github.com/Nimrod007/TED-talks-details-from-TED.com-and-youtube

我在 Mashape 上发布了一个实现此脚本的免费增值 api https://market.mashape.com/bestapi/ted

尽情享受吧!

关于php - Web Crawler - 从 2000 多个网页中获取数据(TED 网站示例),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14925949/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com