php - 带 SSL 和重定向的多线程 cURL-6ren

php - 带 SSL 和重定向的多线程 cURL

转载作者：太空宇宙更新时间：2023-11-03 13:25:26

我有一个非常简单的抓取器，现在可以满足我的需要，但速度很慢，它在 3 秒内抓取了 2 张图片，而我需要做的是在几秒钟内至少抓取 1000 张图片。

这是我现在使用的代码

    <?php
require_once('config.php');

//Calling PHasher class file.
include_once('classes/phasher.class.php');
$I = PHasher::Instance();

//Prevent execution timeout.
set_time_limit(0);

//Solving SSL Problem.
$arrContextOptions=array(
    "ssl"=>array(
        "verify_peer"=>false,
        "verify_peer_name"=>false,
    ),
);

//Check if the database contains hashed pictures or if it's empty, Then start from the latest hashed picture or start from 4.
$check = mysqli_query($con, "SELECT fid FROM images ORDER BY fid DESC LIMIT 1;");
if(mysqli_num_rows($check) > 0){

    $max_fid = mysqli_fetch_row($check);

    $fid = $max_fid[0]+1;
} else {
    $fid = 4;
}

$deletedProfile = "https://z-1-static.xx.fbcdn.net/rsrc.php/v2/yo/r/UlIqmHJn-SK.gif";

//Infinte while loop to fetch profiles pictures and save them inside avatar folder.
$initial = $fid;

while($fid = $initial){

    $url = 'https://graph.facebook.com/'.$fid.'/picture?width=378&height=378';

    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
    curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
    curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
    curl_exec($ch);

    // get the last used URL
    $lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

    curl_close($ch);

    if($lastUrl == $deletedProfile){
        $initial++;
    }else{
        $imageUrl = file_get_contents($url, false, stream_context_create($arrContextOptions));
        $savedImage = dirname(__file__).'/avatar/image.jpg';
        file_put_contents($savedImage, $imageUrl);

        //Exclude deleted profiles or corrupted pictures.
    if(getimagesize($savedImage) > 0 ){

    //PHasher class call to hash the images to hexdecimal values or binary values.
        $hash = $I->FastHashImage($savedImage);
        $hex = $I->HashAsString($hash);

        //Store Facebook id and hashed values for the images in hexa values.
        mysqli_query($con, "INSERT INTO images(fid, hash) VALUES ('$fid', '$hex')");

        $initial++;
    } else {
        $initial++;
    }
}
}

?>

我没有弄清楚如何去做，但我现在想到的是:

1-将每个循环分成1000个profile，存入数组。

    $items = array();
for($i=$fid; $i <= $fid+1000; $i++){

    $url = 'https://graph.facebook.com/'.$i.'/picture?width=378&height=378';
    $items[$i] = array($url);
}

但结果不正确我想知道如何修复数组的输出。

Array ( [28990] => Array ( [0] => https://graph.facebook.com/28990/picture?width=378&height=378 )
[28991] => Array ( [0] => https://graph.facebook.com/28991/picture?width=378&height=378 )
[28992] => Array ( [0] => https://graph.facebook.com/28992/picture?width=378&height=378 )
[28993] => Array ( [0] => https://graph.facebook.com/28993/picture?width=378&height=378 )
[28994] => Array ( [0] => https://graph.facebook.com/28994/picture?width=378&height=378 )
[28995] => Array ( [0] => https://graph.facebook.com/28995/picture?width=378&height=378 )
[28996] => Array ( [0] => https://graph.facebook.com/28996/picture?width=378&height=378 )
[28997] => Array ( [0] => https://graph.facebook.com/28997/picture?width=378&height=378 )

2- 然后我想在 Mulit curl 中使用输出数组，允许异步处理多个 cURL 句柄。

3- 检查输出 URL 是否等于已删除的配置文件，如果不传递它以使用 PHasher 将其转换为哈希值并将其存储在数据库中。

最佳答案

我刚好有你需要的东西，虽然我还没有达到那种吞吐量(每秒 1000 个并行请求)

我忘了我以前从哪里得到这个，但我正在使用它来下载 reddit 内容:

class ParallelCurl {

    public $max_requests;
    public $options;
    public $outstanding_requests;
    public $multi_handle;

    public function __construct($in_max_requests = 10, $in_options = array()) {
        $this->max_requests = $in_max_requests;
        $this->options = $in_options;

        $this->outstanding_requests = array();
        $this->multi_handle = curl_multi_init();
    }

    //Ensure all the requests finish nicely
    public function __destruct() {
        $this->finishAllRequests();
    }

    // Sets how many requests can be outstanding at once before we block and wait for one to
    // finish before starting the next one
    public function setMaxRequests($in_max_requests) {
        $this->max_requests = $in_max_requests;
    }

    // Sets the options to pass to curl, using the format of curl_setopt_array()
    public function setOptions($in_options) {
        $this->options = $in_options;
    }

    // Start a fetch from the $url address, calling the $callback function passing the optional
    // $user_data value. The callback should accept 3 arguments, the url, curl handle and user
    // data, eg on_request_done($url, $ch, $user_data);
    public function startRequest($url, $callback, $user_data = array(), $post_fields = null, $headers = null) {
        if ($this->max_requests > 0)
            $this->waitForOutstandingRequestsToDropBelow($this->max_requests);

        $ch = curl_init();
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt_array($ch, $this->options);
        curl_setopt($ch, CURLOPT_URL, $url);
        if (isset($post_fields)) {
            curl_setopt($ch, CURLOPT_POST, TRUE);
            curl_setopt($ch, CURLOPT_POSTFIELDS, $post_fields);
        }
        if (is_array($headers)) {
            curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
        }

        curl_multi_add_handle($this->multi_handle, $ch);

        $ch_array_key = (int) $ch;
        $this->outstanding_requests[$ch_array_key] = array(
            'link_url' => $url,
            'callback' => $callback,
            'user_data' => $user_data,
        );

        $this->checkForCompletedRequests();
    }

    // You *MUST* call this function at the end of your script. It waits for any running requests
    // to complete, and calls their callback functions
    public function finishAllRequests() {
        $this->waitForOutstandingRequestsToDropBelow(1);
    }

    // Checks to see if any of the outstanding requests have finished
    private function checkForCompletedRequests() {
        /*
          // Call select to see if anything is waiting for us
          if (curl_multi_select($this->multi_handle, 0.0) === -1)
          return;

          // Since something's waiting, give curl a chance to process it
          do {
          $mrc = curl_multi_exec($this->multi_handle, $active);
          } while ($mrc == CURLM_CALL_MULTI_PERFORM);
         */
        // fix for https://bugs.php.net/bug.php?id=63411
        do {
            $mrc = curl_multi_exec($this->multi_handle, $active);
        } while ($mrc == CURLM_CALL_MULTI_PERFORM);
        while ($active && $mrc == CURLM_OK) {
            if (curl_multi_select($this->multi_handle) != -1) {
                do {
                    $mrc = curl_multi_exec($this->multi_handle, $active);
                } while ($mrc == CURLM_CALL_MULTI_PERFORM);
            } else
                return;
        }

        // Now grab the information about the completed requests
        while ($info = curl_multi_info_read($this->multi_handle)) {

            $ch = $info['handle'];
            $ch_array_key = (int) $ch;

            if (!isset($this->outstanding_requests[$ch_array_key])) {
                die("Error - handle wasn't found in requests: '$ch' in " .
                    print_r($this->outstanding_requests, true));
            }

            $request = $this->outstanding_requests[$ch_array_key];
            $url = $request['link_url'];
            $content = curl_multi_getcontent($ch);
            $callback = $request['callback'];
            $user_data = $request['user_data'];

            call_user_func($callback, $content, $url, $ch, $user_data);

            unset($this->outstanding_requests[$ch_array_key]);

            curl_multi_remove_handle($this->multi_handle, $ch);
        }
    }

    // Blocks until there's less than the specified number of requests outstanding
    private function waitForOutstandingRequestsToDropBelow($max) {
        while (1) {
            $this->checkForCompletedRequests();
            if (count($this->outstanding_requests) < $max)
                break;

            usleep(10000);
        }
    }

}

它的工作方式是将一个 URL 和一个回调函数(可以是匿名的)传递给 ParallelCurl::startRequest()，然后为该 URL 排队下载，然后在下载完成时调用该函数。

$pcurl = new ParallelCurl(10, array(
    CURLOPT_RETURNTRANSFER  => 1,
    CURLOPT_FOLLOWLOCATION  => 1,
    CURLOPT_SSL_VERIFYPEER  => 1,
));

$pcurl->startRequest($url, function($data) {
     // download finished. $data is html or binary, whatever you requested
     echo $data;
});

关于php - 带 SSL 和重定向的多线程 cURL，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33563920/

文章推荐： ssl - 使用 TLS 的 etcdctl 失败，curl 成功

文章推荐： python - 如何用可调用类装饰类方法？

文章推荐： asp.net - IIS 不显示功能 "Server Certificates"

curl - 将命令行 cURL 转换为 C cURL
我以前从未做过任何 curl ，所以需要一些帮助。我试图从示例中解决这个问题，但无法理解它! 我有一个 curl 命令，我可以从 Windows 命令行成功运行该命令，该命令行在 Solr 中索引 p
curl - curl -v 和 curl -I 有什么区别
curl -v有什么区别和 curl -I ? 我可以看到 -v是冗长的和 -I是标题。有什么具体的吗？最佳答案 -I (大写字母 i)在 curl 中表示“没有正文”，对于 HTTP 表示发送 H
curl - curl FTP访问被拒绝
我正在使用curl php API访问FTP链接。在特定站点上，它给出错误代码9（拒绝访问）。但是，可以从IE和Firefox访问该链接。然后，我运行curl命令行，它给出了相同的“访问拒绝”结果。
curl - cURL NTLM代理授权
我已经使用curl有一段时间了，它可以正常工作，但是使用使用用户'domain\username'来验证curl的代理时，无法请求授权。授权方法是NTLM。此代码放入批处理文件中。代码: curl
curl - 'curl' 默认使用什么证书？
“curl”默认使用哪些证书？例子: curl -I -L https://cruises.webjet.com.au 在 Ubuntu 15.04 上失败 curl: (60) SSL certi
curl - curl -输出到终端的内容是什么？
我知道终端输出的一部分是请求的持续时间，剩余时间等。但是是否有一些文档指定了curl命令的终端输出的每一列到底是什么？手册页上的内容非常稀疏。最佳答案可能不容易找到，但已在the curl boo
curl - Owncloud上传文件到特定文件夹 - curl
我想通过 curl 在我自己的云服务器上的特定文件夹中上传文件。例如:http://www.myowncloudserver.com/remote.php/webdav/{MY_FOLDER}。此时我
curl - curl 以提示用户名和密码
我的网站上有一个密码保护的Web文件夹，我正在使用Curl在另一个域上获取该文件夹，我想要的是:当我尝试打开URL时，应该问我用户名和密码，而不是让它显示“需要授权”。例: http://www.e
curl - curl 以获取Rabbitmq队列大小
有没有一种方法可以通过简单的Curl获取Rabbitmq中队列的大小（剩余消息）？类似于curl -xget http://host:1234/api/queue/test/stats 谢谢最佳答
curl - curl 免费发布库的动机是什么？
关闭。这个问题是opinion-based .它目前不接受答案。 2年前关闭。锁定。这个问题及其答案是locked因为这个问题是题外话，但具有历史意义。它目前不接受新的答案或互动。我最近开始在我的
curl - 使用带有用户名和密码的 cURL？
我想访问需要用户名/密码的 URL。我想尝试用curl 访问它。现在我正在做类似的事情: curl http://api.somesite.com/test/blah?something=123 我收
curl - 'CURL' 不被识别为内部或外部命令？
我正在尝试使用 CURL 进行查询ElasticSearch 中的命令在windows平台。例如:localhost:9200/playground/equipment/1?pretty 我收到一条
curl - 在运行容器时成功使用 curl
我正在尝试使用 Docker 构建和运行 Marklogic 实例。 Marklogic 提供了一些不错的 http api，所以，作为最终 CMD在 Dockerfile 中，我运行两个脚本，它们通
curl - 加载超时阻止使用 Curl
我正在尝试通过 cURL 检索网页的内容(比方说 http://www.foo.com/bar.php )。当我在浏览器中加载网站时，加载页面时会出现动画，页面最终会显示出来。但是使用 cURL，
curl - 带有代理和响应状态代码的命令行 curl
我正在尝试使用带代理的命令行 CURL 获取响应状态代码。这会返回整个页面，但我只想要状态代码。我怎么做？谢谢。 curl -sL -w -x IP:PORT "%{http_code}\n""ht
curl - net/http vs curl - 为什么在 curl 不超时的情况下超时？
我有一段代码检查 http/s 端点的状态和加载时间。然后我会为每个顶级页面检查 1 级 href，以检查页面引用的所有内容是否也加载了 200。 (我查了50个顶级页面，每个顶级页面平均有8个链接)
curl - curl --upload-file 和 curl --form file=@filename 有什么区别
curl --upload-file 和 curl --form file=@/path/file 有什么区别？这些 HTTP 请求有何不同？最佳答案 --上传文件 (使用 HTTP 或 HTTPS
curl - CMAKE_USE_SYSTEM_CURL 已打开但未找到 curl
我正在尝试使用 system-curl 安装 cmake，使用 ./bootstrap --system-curl，如 here 所示.这样做，我得到了: -- Could NOT find
curl - 为什么 Curl 会忽略给定的范围？
我需要使用 Curl 下载 Youtube 视频的特定部分。 (假设我想下载前 2MB)我在 Curl 中使用 -r 开关来实现这一点。它适用于非 YouTube 链接，但 Youtube 链接会忽略
curl - 在 curl 命令中为文件名添加时间戳
我希望在使用 curl 命令从远程服务器下载文件后，将时间戳或日期添加到文件名中。我知道您可以使用 -o 来指定您要为文件命名的内容。我看到过这样的建议:-o "somefile $(date +\"

太空宇宙

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

php - 带 SSL 和重定向的多线程 cURL