c# - 从 Pinterest 网址获取图板中的所有图像-6ren

c# - 从 Pinterest 网址获取图板中的所有图像

转载作者：太空狗更新时间：2023-10-29 14:18:08

这道题听上去很简单，其实并不像听起来那么简单。

问题的简要总结

例如，使用这个板； http://pinterest.com/dodo/web-designui-and-mobile/

检查页面顶部板本身的 HTML(在 div 中，类 GridItems)产生:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 0px; left: 0px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 3343px; left: 1000px; visibility: visible;">..</div>
</div>

然而在页面底部，在激活无限滚动几次之后，我们得到了 HTML:

<div class="variableHeightLayout padItems GridItems Module centeredWithinWrapper" style="..">
    <!-- First div with a displayed board image -->
    <div class="item" style="top: 12431px; left: 750px; visibility: visible;">..</div>
    ...
    <!-- Last div with a displayed board image -->
    <div class="item" style="top: 19944px; left: 750px; visibility: visible;">..</div>
</div>

如您所见，页面上方的一些图像容器已经消失，并且并非所有图像容器都在首次加载页面时加载。

我想做什么

我希望能够创建一个 C# 脚本(或目前任何服务器端语言)来下载页面的完整 HTML(即检索页面上的每个图像)，然后图像将从他们的网站下载网址。下载网页并使用适当的 XPath 很容易，但真正的挑战是为每个图像下载完整的 HTML。

有没有一种方法可以模拟滚动到页面底部，或者是否有一种更简单的方法可以检索每张图片？我想象 Pinterest 使用 AJAX 来更改 HTML，有没有办法以编程方式触发事件以接收所有 HTML？预先感谢您提供建议和解决方案，如果您没有任何建议和解决方案，甚至阅读这个非常长的问题，我将感到荣幸!

伪代码

using System;
using System.Net;
using HtmlAgilityPack;

private void Main() {
    string pinterestURL = "http://www.pinterest.com/...";
    string XPath = ".../img";

    HtmlDocument doc = new HtmlDocument();

    // Currently only downloads the first 25 images.
    doc.Load(strPinterestUrl);

    foreach(HtmlNode link in doc.DocumentElement.SelectNodes(strXPath))
    {
         image_links[] = link["src"];
         // Use image links
    }
}

最佳答案

好的，所以我认为这可能是(稍作改动)您所需要的。

注意事项:

这是 PHP，不是 C#(但您说您对任何服务器端语言都感兴趣)。
此代码连接到(非官方)Pinterest 搜索端点。您需要更改 $data 和 $search_res 以反射(reflect)适合您任务的端点(例如 BoardFeedResouce)。注意:至少对于搜索，Pinterest 目前使用两个端点，一个用于初始页面加载，另一个用于无限滚动操作。每个都有自己的预期参数结构。
Pinterest 没有官方的公共(public) API，只要他们更改任何内容，就会在没有警告的情况下中断。
您可能会发现 pinterestapi.co.uk 更易于实现并且可以接受您正在做的事情。
我在该类下面有一些演示/调试代码，一旦您获得了您想要的数据，这些代码就不应该存在了，还有一个您可能想要更改的默认页面获取限制。

兴趣点:

下划线 _ 参数采用 JavaScript 格式的时间戳，即。像 Unix 时间，但它增加了毫秒。它实际上并不用于分页。
分页使用 bookmarks 属性，因此您向不需要它的"new"端点发出第一个请求，然后从结果中获取 bookmarks并在您的请求中使用它来获取下一个“页面”结果，从这些结果中获取 bookmarks 以获取之后的下一页，依此类推，直到您用完结果或达到您的预-设置限制(或者你达到脚本执行时间的服务器最大值)。我很想知道 bookmarks 字段编码的确切内容。我想除了 pin ID 或其他一些页面标记之外，还有一些有趣的秘方。
我正在跳过 html，而是处理 JSON，因为(对我而言)它比使用 DOM 操作解决方案或一堆正则表达式更容易。

<?php

if(!class_exists('Skrivener_Pins')) {

  class Skrivener_Pins {

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Pinterest search function. Uses Pinterest's "internal" page APIs, so likely to break if they change.
     * @author [@skrivener] Philip Tillsley
     * @param $search_str     The string used to search for matching pins.
     * @param $limit          Max number of pages to get, defaults to 2 to avoid excessively large queries. Use care when passing in a value.
     * @param $bookmarks_str  Used internally for recursive fetches.
     * @param $pages          Used internally to limit recursion.
     * @return array()        int['id'], obj['image'], str['pin_link'], str['orig_link'], bool['video_flag']
     * 
     * TODO:
        * 
        * 
     */
    public function get_tagged_pins($search_str, $limit = 1, $bookmarks_str = null, $page = 1) {

      // limit depth of recursion, ie. number of pages of 25 returned, otherwise we can hang on huge queries
      if( $page > $limit ) return false;

      // are we getting a next page of pins or not
      $next_page = false;
      if( isset($bookmarks_str) ) $next_page = true;

      // build url components
      if( !$next_page ) {

        // 1st time
        $search_res = 'BaseSearchResource'; // end point
        $path = '&module_path=' . urlencode('SearchInfoBar(query=' . $search_str . ', scope=boards)');
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "scope":"pins",
            "show_scope_selector":true,
            "query":"' . $search_str . '"
          },
          "context":{
            "app_version":"2f83a7e"
          },
          "module":{
            "name":"SearchPage",
            "options":{
              "scope":"pins",
              "query":"' . $search_str . '"
            }
          },
          "append":false,
          "error_strategy":0
          }');
      } else {

        // this is a fetch for 'scrolling', what changes is the bookmarks reference, 
        // so pass the previous bookmarks value to this function and it is included
        // in query
        $search_res = 'SearchResource'; // different end point from 1st time search
        $path = '';
        $data = preg_replace("'[\n\r\s\t]'","",'{
          "options":{
            "query":"' . $search_str . '",
            "bookmarks":["' . $bookmarks_str . '"],
            "show_scope_selector":null,
            "scope":"pins"
          },
          "context":{
            "app_version":"2f83a7e"
          },
            "module":{
              "name":"GridItems",
            "options":{
              "scrollable":true,
              "show_grid_footer":true,
              "centered":true,
              "reflow_all":true,
              "virtualize":true,
              "item_options":{
                "show_pinner":true,
                "show_pinned_from":false,
                "show_board":true
              },
              "layout":"variable_height"
            }
          },
          "append":true,
          "error_strategy":2
        }');
      }
      $data = urlencode($data);
      $timestamp = time() * 1000; // unix time but in JS format (ie. has ms vs normal server time in secs), * 1000 to add ms (ie. 0ms)

      // build url
      $url = 'http://pinterest.com/resource/' . $search_res . '/get/?source_url=/search/pins/?q=' . $search_str
          . '&data=' . $data
          . $path
          . '&_=' . $timestamp;//'1378150472669';

      // setup curl
      $ch = curl_init();
      curl_setopt($ch, CURLOPT_URL, $url);
      curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
      curl_setopt($ch, CURLOPT_HTTPHEADER, array("X-Requested-With: XMLHttpRequest"));

      // get result
      $curl_result = curl_exec ($ch); // this echoes the output
      $curl_result = json_decode($curl_result);
      curl_close ($ch);

      // clear html to make var_dumps easier to see when debugging
      // $curl_result->module->html = '';

      // isolate the pin data, different end points have different data structures
      if(!$next_page) $pin_array = $curl_result->module->tree->children[1]->children[0]->children[0]->children;
      else $pin_array = $curl_result->module->tree->children;

      // map the pin data into desired format
      $pin_data_array = array();
      $bookmarks = null;
      if(is_array($pin_array)) {
        if(count($pin_array)) {

          foreach ($pin_array as $pin) {

            //setup data
            $image_id = $pin->options->pin_id;
            $image_data = ( isset($pin->data->images->originals) ) ? $pin->data->images->originals : $pin->data->images->orig;
            $pin_url = 'http://pinterest.com/pin/' . $image_id . '/';
            $original_url = $pin->data->link;
            $video = $pin->data->is_video;

            array_push($pin_data_array, array(
              "id"          => $image_id,
              "image"       => $image_data,
              "pin_link"    => $pin_url,
              "orig_link"   => $original_url,
              "video_flag"  => $video,
              ));
          }
          $bookmarks = reset($curl_result->module->tree->resource->options->bookmarks);

        } else {
          $pin_data_array = false;
        }
      }

      // recurse until we're done
      if( !($pin_data_array === false) && !is_null($bookmarks) ) {

        // more pins to get
        $more_pins = $this->get_tagged_pins($search_str, $limit, $bookmarks, ++$page);
        if( !($more_pins === false) ) $pin_data_array = array_merge($pin_data_array, $more_pins);
        return $pin_data_array;
      }

      // end of recursion
      return false;
    }

  } // end class Skrivener_Pins
} // end if



/**
 * Debug/Demo Code
 * delete or comment this section for production
 */

// output headers to control how the content displays
// header("Content-Type: application/json");
header("Content-Type: text/plain");
// header("Content-Type: text/html");

// define search term
// $tag = "vader";
$tag = "haemolytic";
// $tag = "qjkjgjerbjjkrekhjk";

if(class_exists('Skrivener_Pins')) {

  // instantiate the class
  $pin_handler = new Skrivener_Pins();

  // get pins, pinterest returns 25 per batch, function pages through this recursively, pass in limit to 
  // override default limit on number of pages to retrieve, avoid high limits (eg. limit of 20 * 25 pins/page = 500 pins to pull 
  // and 20 separate calls to Pinterest)
  $pins1 = $pin_handler->get_tagged_pins($tag, 2);

  // display the pins for demo purposes
  echo '<h1>Images on Pinterest mentioning "' . $tag . '"</h1>' . "\n";
  if( $pins1 != false ) {
    echo '<p><em>' . count($pins1) . ' images found.</em></p>' . "\n";
    skrivener_dump_images($pins1, 5);
  } else {
    echo '<p><em>No images found.</em></p>' . "\n";
  }
}

// demo function, dumps images in array to html img tags, can pass limit to only display part of array
function skrivener_dump_images($pin_array, $limit = false) {
  if(is_array($pin_array)) {
    if($limit) $pin_array = array_slice($pin_array, -($limit));
    foreach ($pin_array as $pin) {
      echo '<img src="' . $pin['image']->url . '" width="' . $pin['image']->width . '" height="' . $pin['image']->height . '" >' . "\n";
    }
  }
}

?>

如果您在使它适应您的特定终点时遇到问题，请告诉我。 Apols 对于代码中的任何草率，它最初并没有投入生产。

关于c# - 从 Pinterest 网址获取图板中的所有图像，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18312479/

文章推荐： java - android 4.2.2 是否支持 jre 1.7？

文章推荐： html - overflow hidden 时的 Firefox 文本缩进错误

c# - 异步任务获取 VS HttpResponseMessage 获取
我需要您在以下方面提供帮助。近一个月来，我一直在阅读有关任务和异步的内容。我想尝试在一个简单的 wep api 项目中实现我新获得的知识。我有以下方法，并且它们都按预期工作: public Htt
java - 无法从 URL 获取 URI，获取 null？
我的可执行 jar 中有一个模板文件 (.xls)。不需要在运行时我需要为这个文件创建 100 多个副本(稍后将唯一地附加)。用于获取 jar 文件中的资源 (template.xls)。我正在使用
javascript - Backbone 的模型原型(prototype)获取 vs backbone 获取
我在查看网站的模型代码时对原型(prototype)有疑问。我知道这对 Javascript 中的继承很有用。在这个例子中... define([], function () { "use
javascript - 获取 scrollTop、获取 offsetHeight 和 getStyle 需要很长时间
影响我性能的前三项操作是: 获取滚动条获取偏移高度 Ext.getStyle 为了解释我的应用程序中发生了什么:我有一个网格，其中有一列在每个单元格中呈现网格。当我几乎对网格的内容做任何事情时，它运
javascript - 获取 URL 参数函数，获取 url 部分的值，或者如果存在但没有值则返回 true？
我正在使用以下函数来获取 URL 参数。 function gup(name, url) { name = name.replace(/[\[]/, '\\\[').replace(/[\]]/,
c - MacOS 使用 sysctl() 获取 HW_MACHINE_ARCH 获取 "no such file or directory"
我最近一直在使用 sysctl 来做很多事情，现在我使用 HW_MACHINE_ARCH 变量。我正在使用以下代码。请注意，当我尝试获取其他变量 HW_MACHINE 时，此代码可以完美运行。我还认为
ios - 将我的 YouTube channel 获取(获取)到我的 iOS 应用程序中
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。关闭 9 年前。要求提供代码的问题必须表现出对所解决问题的最低限度的理解。包括尝试过的解决方案、为什么
javascript - webpack:如何从 "bower_components"获取 JavaScript，而不是从 "node_modules"获取 JavaScript
由于使用 main-bower-files 作为使用 Gulp 的编译任务的一部分，我无法使用 node_modules 中的 webpack 来require 模块code> dir 因为我会弄乱当
Javascript - 从 "Monday"获取 "mon"或从 "Tuesday"获取 "tue"等
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 5 年前。 Improve this qu
Java:无法将 Gridlayout 应用于 Jscrollpane。获取获取 java.lang.ClassCastException
我使用 Gridlayout 在一行中放置 4 个元素。首先，我有一个 JPanel，一切正常。对于行数变大并且我必须能够向下滚动的情况，我对其进行了一些更改。现在我的 JPanel 上添加了一个 J
python - 如何从 key BlockDeviceMappings 获取 VolumeId(boto3 获取 ec2 的卷信息)
由于以下原因，我想将 VolumeId 的值保存在变量中: #!/usr/bin/env python import boto3 import json import argparse import
angularjs - 未使用 acquireTokensilent 获取 token ，但使用 acquireTokenpopup Msal-browser 获取 token
我正在将 MSAL 版本 1.x 更新为 MSAL-browser 的 Angular 。所以我正在尝试从版本 1.x 迁移到 2.X.I 能够成功替换代码并且工作正常。但是我遇到了 acquireT
python - 使用 GroupBy 获取 Pandas 的平均值 - 获取 DataError : No numeric types to aggregate -
我知道有很多关于此的问题，例如 Getting daily averages with pandas和 How get monthly mean in pandas using groupby但我遇到
javascript - 无法在 mvc 获取 Controller 方法(来自 Uri())中从 QueryString 获取 DATETIME
This is the query string that I am receiving in URL. Output url: /demo/analysis/test?startDate=Sat+
ubuntu - 从 OpenLayer 3 获取 Geoserver 获取 '500 (Internal Server Error)'
我正在尝试使用 javascript 中的以下代码访问 Geoserver 层 var gkvrtWmsSource =new ol.source.ImageWMS({ u
javascript - 使用 XMLHttpRequest 获取 Ecobee API 信息。获取 500(错误 1 : "Authentication failed. Token is required.")
API 需要一个包含授权代码的 header 。这就是我到目前为止所拥有的: var fullUrl = 'https://api.ecobee.com/1/thermostat?json=\{"s
c# - 获取/删除文件的最后一个字符而不加载到内存中
如何获取文件中的最后一个字符，如果是某个字符，则删除它而不将整个文件加载到内存中？这就是我目前所拥有的。 using (var fileStream = new FileStream("file.t
JSP 获取/设置整个对象的参数
我是这个社区的新手，想出了我的第一个问题。我正在使用 JSP，我成功地创建了 JSP-Sites，它正在使用jsp:setParameter 和 jsp:getParameter 具有单个字符串。
multithreading - 获取/释放语义
在回答 StoreStore reordering happens when compiling C++ for x86 @Peter Cordes 写过 For Acquire/Release se
javascript - 获取 .on 中使用的函数的结果
我有一个函数，我们将其命名为 X1，它返回变量 Y。该函数在操作 .on("focusout", X1) 中使用。如何获取变量Y？执行.on后X1的结果？最佳答案您可以更改 Y 的范围以使其位于函

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c# - 从 Pinterest 网址获取图板中的所有图像