gpt4 book ai didi

dom - 如何在 Chrome / Chrome headless 上倾倒超过 ?

转载 作者:行者123 更新时间:2023-12-04 12:17:07 25 4
gpt4 key购买 nike

Chrome 的文档说明:

The --dump-dom flag prints document.body.innerHTML to stdout:



根据标题,如何使用 Chromium headless 转储更多的 DOM 对象(最好是全部)?我可以通过开发人员工具手动保存整个 DOM,但我想要一个程序化的解决方案。

最佳答案

更新 2019-04-23 谷歌在 headless 方面非常活跃,发生了许多更新

下面的答案适用于 v62 当前版本是 v73,并且它一直在更新。
https://www.chromestatus.com/features/schedule

我强烈建议检查 puppeteer 是否有 headless chrome 的任何 future 发展。它由 Google 维护,并与 npm package 一起安装所需的 Chrome 版本。因此,您只需使用文档中的 puppeteer API,而不必担心 Chrome 版本并设置 headless Chrome 和开发工具 API 之间的连接,这可以实现 99% 的魔法。

  • repo :https://github.com/GoogleChrome/puppeteer
  • 文档:https://pptr.dev/


  • 更新 2017-10-29 Chrome 已经有 --dump-html 标志,它返回完整的 HTML,而不仅仅是正文。

    v62 确实有它,它已经在稳定 channel 上。

    修复此问题的问题: https://bugs.chromium.org/p/chromium/issues/detail?id=752747

    当前 chrome 状态(每个 channel 的版本) https://www.chromestatus.com/features/schedule

    为遗产留下旧答案

    You can do it with google chrome remote interface. I have tried it and wasted couple hours trying to launch chrome and get full html, including title and it is just not ready yet, i would say.

    It works sometimes but i've tried to run it in production environment and got errors time to time. All kind of random errors like connection reset and no chrome found to kill. Those errors rised up sometimes and it's hard to debug.

    I personally use --dump-dom to get html when i need body and when i need title i just use curl for now. Of course chrome can give you title from SPA applications, which can not be done with only curl if title is set from JS. Will switch to google chrome after having stable solution.

    Would love to have --dump-html flag on chrome and just get all html. If Google's engineer is reading this, please add such flag to chrome.

    I've created issue on Chrome issue tracker, please click favorite "star" to get noticed by google developers:

    https://bugs.chromium.org/p/chromium/issues/detail?id=752747

    Here is a long list of all kind of flags for chrome, not sure if it's full and all flags: https://peter.sh/experiments/chromium-command-line-switches/ nothing to dump title tag.

    This code is from Google's blog post, you can try your luck with this:

    const CDP = require('chrome-remote-interface');

    ...

    (async function() {

    const chrome = await launchChrome();
    const protocol = await CDP({port: chrome.port});

    // Extract the DevTools protocol domains we need and enable them.
    // See API docs: https://chromedevtools.github.io/devtools-protocol/
    const {Page, Runtime} = protocol;
    await Promise.all([Page.enable(), Runtime.enable()]);

    Page.navigate({url: 'https://www.chromestatus.com/'});

    // Wait for window.onload before doing stuff.
    Page.loadEventFired(async () => {
    const js = "document.querySelector('title').textContent";
    // Evaluate the JS expression in the page.
    const result = await Runtime.evaluate({expression: js});

    console.log('Title of page: ' + result.result.value);

    protocol.close();
    chrome.kill(); // Kill Chrome.
    });

    })();

    Source: https://developers.google.com/web/updates/2017/04/headless-chrome

    关于dom - 如何在 Chrome / Chrome headless 上倾倒超过 <body> ?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44851729/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com