javascript - 从没有 API/APP key / token / secret 的 Facebook 页面提取公共(public)帖子-6ren

javascript - 从没有 API/APP key / token / secret 的 Facebook 页面提取公共(public)帖子

转载作者：行者123 更新时间：2023-12-05 00:32:40

提前澄清一下，我没有 Facebook 帐户，也无意创建一个。此外，我正在努力实现的目标在我的国家和美国完全合法。

我不想使用 Facebook API 来获取 Facebook 页面的最新时间线帖子，而是想直接向页面 URL(例如 this page )发送获取请求并从 HTML 源代码中提取帖子。
(我想获取帖子的文本和创建时间。)

当我在 Web 控制台中运行它时:

document.getElementsByClassName('userContent')

我得到一个包含最新帖子文本的元素列表。

但我想从 nodejs 脚本中提取该信息。我可以很容易地使用像 puppeteer 这样的 headless 浏览器来完成它。或类似的东西，但这会产生大量不必要的开销。我真的很想要一个简单的方法，比如下载 HTML 代码，将它传递给cheerio 并使用cheeriio 的类似 jQuery 的 API 来提取帖子。

这是我尝试的尝试:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
    const $ = cheerio.load(postsHtml);

    const timeLinePostEls = $('.userContent');
    console.log(timeLinePostEls.html()); // should NOT be null
    const newestPostEl = timeLinePostEls.get(0);
    console.log(newestPostEl.html()); // should NOT be null
    const newestPostText = newestPostEl.text();
    console.log(newestPostText);
    //const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
    //console.log(newestPostTime);
}).catch(console.error);

不幸的是 $('.userContent')不起作用。但是，我能够验证我要查找的数据是否嵌入在该 HTML 代码中的某个位置。

但我真的无法想出一个好的正则表达式方法或类似的方法来提取该数据。

根据帖子内容，帖子中 HTML 标记的数量差异很大。

这是一个包含一个链接的帖子的简单示例:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;"><p>We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. <a href="https://l.facebook.com/l.php?u=https%3A%2F%2Fbit.ly%2F2H3Kbr2&amp;h=AT29h2HyDsEk0rHRWqJA-Fa4M1qi3nJT1NBi95othaR3qeFuFAMNiVS2Dgtv5KR5m0xqjw6kfwZdhZt0_D3UQT1Oel2UhxRql-KwkA1xqWvrql4u1jDhzrkGVT_XxoUd8_w8_fzLZzzhz23a8yPCK6IPfWKB76_CEFjG3b78y4dFJvY9Z08AYlR01dmi5_FvWVEVytkN-123u6alYE8pqL6Jb6dtIQUTWGXYJPaNMrtxkCUZniEVXEcILkwHGSuHqCTAarboyMP55F1vhYO3OAiVMkvjbN274fVq92YvbK3bi90bU9T-5ADWHDUJ-CwcofSBTW47chstQeY0n_UluD_rBIPLsfXVSnCtpRkR2kXi9zzHLnNeIYeNssv3i7UKS_f5Z2pnVT6xe3zJbNpB68doH1Z__I9nsTCNIyFyKx2VxabecoL03DIawbRrzBoxLAwzNPLACBjTkpEQhdVn4_wdAIjXRL4cLQDcZkLEoG_sspBgRePH23TFbNufQOBly-FNtLHnkUDO2Ca-FYvAGXpcu6J4B1aH3XFPB803lsz-GRdACyOFOgXDXJfwr4WtWzUHxfiOPULWiI43yI5L4aU6wYRhPjxua3RuRZ8oj9fXa1w4Jrht94Ue2wfKtz8" target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">http://*******/2H3Kbr2</a></p></div>

以更易读的形式格式化，它看起来有点像这样:

<div class="_5pbx userContent _3576" data-ft="&#123;&quot;tn&quot;:&quot;K&quot;&#125;">
    <p>
        We&#039;re proud to be named one of Built In NYC&#039;s Best Places to Work in 
        2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for 
        Best Perks and Benefits. See what it took to make the list and check out our 
        profile to see some of our job openings.
        <a href="VERY_LONG_URL.........." target="_blank" data-ft="&#123;&quot;tn&quot;:&quot;-U&quot;&#125;" rel="noopener nofollow" data-lynx-mode="async">SHORT_LINK.....</a>
    </p>
</div>

这个正则表达式 seems工作正常，但我认为它不是很可靠:

/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g

例如，如果帖子包含另一个 div 元素，那么它将无法正常工作。除此之外，我无法知道使用这种方法创建帖子的时间/日期？

有什么想法可以相对可靠地提取最近的 2-3 个帖子，包括创建日期/时间？

最佳答案

好吧，我终于想通了。我希望这对其他人有用。这个函数会提取20个最新的帖子，包括创建时间:

// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

function GetFbPosts(pageUrl) {
    const requestOptions = {
        url: pageUrl,
        headers: {
            'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
        }
    };
    return rp.get(requestOptions).then( postsHtml => {
        const $ = cheerio.load(postsHtml);
        const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
        const posts = timeLinePostEls.map(post=>{
            return {
                message: post.html(),
                created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
            }
        });
        return posts;
    });
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
    // Log all posts
    for (const post of posts) {
        console.log(post.created_at, post.message);
    }
});

由于 Facebook 消息可能具有复杂的格式，因此消息不是纯文本，而是 HTML。但是您可以删除格式并通过替换 message: post.html() 来获取文本与 message: post.text() .

编辑:
如果你想获得超过最新的 20 个帖子，那就更复杂了。前 20 个帖子在初始 html 页面上静态提供。以下所有帖子均通过 ajax 以 8 个帖子为一组进行检索。
可以这样实现:

// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');

class FbScrape {
    constructor(options={}) {
        this.headers = options.headers || {
            'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
        };
    }

    async getPosts(pageUrl, limit=20) {
        const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
        if (limit <= 20) {
            return this._parsePostsHtml(staticPostsHtml);
        } else {
            let staticPosts = this._parsePostsHtml(staticPostsHtml);
            const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
            const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
            return staticPosts.concat(ajaxPosts);
        }
    }

    _parsePostsHtml(postsHtml) {
        const $ = cheerio.load(postsHtml);
        const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
        const posts = timeLinePostEls.map(post => {
            return {
                message: post.html(),
                created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
            }
        });
        return posts;
    }

    async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
        const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
        const extractedJson = JSON.parse(responseBody.substr(9));
        const postsHtml = extractedJson.domops[0][3].__html;
        const newPosts = this._parsePostsHtml(postsHtml);
        const allPosts = posts.concat(newPosts);
        const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
        if (allPosts.length+1 >= limit)
            return allPosts;
        else
            return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
    }

    _getNextPageAjaxUrl(html) {
        return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&amp;/g, '&') + '&__a=1';
    }
}

const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
    // Log all posts
    for (const post of posts) {
        console.log(post.created_at, post.message);
    }
});

关于javascript - 从没有 API/APP key / token / secret 的 Facebook 页面提取公共(public)帖子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54256433/

文章推荐： javascript - 如何更改 App sync AWS 的状态码？

文章推荐： r - 偏移量不适用于二项式 GLM

文章推荐： extjs - 获取文本框的输入

文章推荐： javascript - Chrome 没有选择源 map ，尽管它已启用

android - 使用刷新 token 在访问 token 过期之前刷新访问 token
我正在开发一个应用程序，它使用 OAuth - 基于 token 的身份验证。考虑到我们拥有访问和刷新 token ，这就是流程的样子。 Api call -> intercepter append
python - 如何取消对 spacy.tokens.token.Token 的标记？
如何取消标记此代码的输出？类(class)核心: def __init__(self, user_input): pos = pop(user_input) subject = ""
kubernetes - kubectl --token=$TOKEN 没有使用 token 的权限运行
当我使用命令 kubectl 时与 --token标记并指定 token ，它仍然使用 kubeconfig 中的管理员凭据文件。这是我做的: NAMESPACE="default" SERVICE
security - 访问 token 和刷新 token 最佳实践？如何实现访问和刷新 token
我正在制作 SPA，并决定使用 JWT 进行身份验证/授权，并且我已经阅读了一些关于 Tokens 与 Cookies 的博客。我了解 cookie 授权的工作原理，并了解基本 token 授权的工作
azure - 请求刷新 token 失败。在 token 存储中找不到刷新 token
我正在尝试从应用服务获取 Google 的刷新 token ，但无法。日志说 2016-11-04T00:04:25 PID[500] Verbose Received request: GET h
java - token 语法错误 "(", ; token ","上的预期语法错误，； token ")"上的预期语法错误，；预期的
我正在开发一个项目，只是为了为 java 开发人员测试 eclipse IDE。我是java新手，所以我想知道为什么它不起作用，因为我已经知道该怎么做了。这是代码: public class ecli
asp.net - token 处理程序无法将 token 转换为 jwt token
我正在尝试使用 JwtSecurityTokenHandler 将 token 字符串转换为 jwt token 。但它出现错误说 IDX12709: CanReadToken() returned
android - Facebook 用户访问 token 与应用程序访问 token 与页面访问 token
我已阅读文档 Authentication (来自 Facebook 的官方)。我仍然不明白 Facebook 提供的这三种访问 token 之间的区别。网站上给出了一些例子，但我还是不太明白。每个
c# - 防伪 token 无法解密 & 防伪cookie token 和表单字段 token 在部署中不匹配
我的部署服务器有时有这个问题，这让我抓狂，因为我无法在本地主机中重现，我已经尝试在我的 web.config 中添加机器 key ，但没有成功远。它只发生在登录页面。我的布局:
c# - 如何在不创建新刷新 token 的情况下使用刷新 token 更新 Owin 访问 token ？
我已经设法获得了一个简单的示例代码，它可以创建一个不记名 token ，还可以通过阅读 stackoverflow 上的其他论坛来通过刷新 token 请求新的不记名 token 。启动类是这样的
php - Google Api，当我有访问 token 和以前的刷新 token 时如何刷新用户 token
如果我有以前的刷新 token 和使用纯 php 的访问 token ，没有 Google Api 库，是否可以刷新 Google Api token ？我在数据库中存储了许多用户刷新和访问 toke
java - token 无效 - 无效 token : Cannot parse referred token string: Invalid gaia_data. Base64 token 上的 AuthSubToken 原型(prototype)
我通过 Java 应用程序使用 Google 电子表格时遇到了问题。我创建了应用程序，该应用程序运行了 1 年多，没有任何问题，我什至在 Create Spreadsheet using Google
Keycloak admin REST API - 使用刷新 token 创建新的访问 token 而不重新创建刷新 token
当我有一个有效的刷新 token 时，我正在尝试使用 Keycloak admin REST API 重新创建访问 token 。我已经通过调用 POST/auth/realms/{realm}/p
wcf - 找不到 'System.IdentityModel.Tokens.UserNameSecurityToken' token 类型的 token 验证器。
我正在尝试让第三方 Java 客户端与我编写的 WCF 服务进行通信。收到消息时出现如下异常: Cannot find a token authenticator for the 'System.I
sql - 解析查询时出错。 [ token 行号=1， token 行偏移量=52， token 错误=)]
在尝试将数据插入到我的 SQl 数据库时，我收到以下错误 System.Data.SqlServerCe.SqlCeException: There was an error parsing the
access-token - JSON Web token (JWT) 相对于数据库 session token 的优势
使用数据库 session token 系统，我可以让用户使用用户名/密码登录，服务器可以生成 token (例如 uuid)并将其存储在数据库中并将该 token 返回给客户端。其上的每个请求都将包
azure - 错误: The received token is of incorrect token type -- What should the token look like?
我最近注册了 Microsoft Azure 并设置了认知服务帐户。使用 Text Translation API Documentation 中的说明我能够使用 interactive online
asp.net - 所提供的防伪 token 验证失败。 cookie token 和请求 token 已交换
我使用 IAntiforgery API 创建了一个 ASP.Net Core 2 应用程序。这提供了一种返回 cookie 的方法。客户端获取该 cookie，并在后续 POST 请求中将该值放
python - 基于 Spacy token 的匹配， token 之间的 token 数量为 'n'
我正在使用 spacy 来匹配某些文本(意大利语)中的特定表达式。我的文本可以多种形式出现，我正在尝试学习编写一般规则的最佳方式。我有如下 4 个案例，我想写一个适用于所有案例的通用模式。像这样的东西
javascript - OAuth 2.0 token 处理。是否有服务器 token 和客户端 token ？
我无法理解 oauth 2.0 token 的原则处理。我的场景是，我有一个基于 web 的前端后端系统，带有 node.js 和 angular 2。用户应该能够在此站点上上传视频。然后创建一些额

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

javascript - 从没有 API/APP key / token / secret 的 Facebook 页面提取公共(public)帖子