gpt4 book ai didi

node.js - 无法即时抓取和打印链接

转载 作者:太空宇宙 更新时间:2023-11-04 03:18:59 25 4
gpt4 key购买 nike

我在 node.js 中编写了一个脚本,用于从网页中抓取不同标题的链接。当我执行以下脚本时,我在控制台中打印了 undefined 而不是我想要的 links 。我定义的选择器是准确的。

我不希望将链接放入数组中并返回结果;相反,我希望即时打印它们。由于我对使用 node.js 结合 puppeteer 编写脚本还很陌生,所以我无法弄清楚我所犯的错误。

这是我的脚本(Link to that site):

const puppeteer = require('puppeteer');
function run () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
let url = await page.evaluate(() => {
let items = document.querySelectorAll('a.question-hyperlink');
items.forEach((item) => {
//would like to keep the following line intact
console.log(item.getAttribute('href'));
});
})
browser.close();
return resolve(url);
} catch (e) {
return reject(e);
}
})
}
run().then(console.log).catch(console.error);

The following script works just fine if I consider to declare an empty array results and store the scraped links within it and finally return the resultsbut I do not wish to go like this. I would like to stick to the way I tried above, as in printing the result on the fly.

const puppeteer = require('puppeteer');
function run () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
let urls = await page.evaluate(() => {
let results = [];
let items = document.querySelectorAll('a.question-hyperlink');
items.forEach((item) => {
results.push({
url: item.getAttribute('href'),
});
});
return results;
})
browser.close();
return resolve(urls);
} catch (e) {
return reject(e);
}
})
}
run().then(console.log).catch(console.error);

再一次:我的问题是如何即时打印 console.log(item.getAttribute('href')); 这样的链接而不将其存储在数组中?

最佳答案

要在evaluate()内运行console.log(),只需复制下面定义页面的行

page.on('console', obj => console.log(obj._text));

所以现在整个片段将像这样

const puppeteer = require('puppeteer');
function run () {
return new Promise(async (resolve, reject) => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
page.on('console', obj => console.log(obj._text));
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping");
let url = await page.evaluate(() => {
let items = document.querySelectorAll('a.question-hyperlink');
items.forEach((item) => {
//would like to keep the following line intact
console.log(item.getAttribute('href'));
});
})
browser.close();
return resolve(url);
} catch (e) {
return reject(e);
}
})
}
run().then(console.log).catch(console.error);

希望这有帮助

关于node.js - 无法即时抓取和打印链接,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52710298/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com