gpt4 book ai didi

javascript - 网络爬虫的异步请求

转载 作者:行者123 更新时间:2023-12-01 01:17:05 26 4
gpt4 key购买 nike

我有一个 URL 数组,我想从每个 URL 中抓取一个 html 表并将其保存在另一个数组中按照与原始数组相同的顺序

由于节点的异步特性,我认为它没有按我的预期工作,因此每次结果的顺序都不同。

我在 google 上搜索了很多并尝试了不同的方法,例如使用自定义 async-forEach-function 或 request-promise 而不是 request,但没有任何效果。

const request = require('request');
const rp = require('request-promise');
const cheerio = require('cheerio');
const fs = require('fs');

let verbs = [];
let conjugations = [];
fs.readFileSync('verbs.txt', 'utf-8').split(/\r?\n/).forEach
(function(line){
verbs.push(line);
});

verbs.forEach((verb) => {
const URI = encodeURI("https://ru.wiktionary.org/wiki/" + verb);


var options = {
uri: URI,
transform: function (body) {
return cheerio.load(body);
}


};

rp(options).then(function ($) {
let table = $('span#Русский.mw-headline').parent().nextAll('table').first();
conjugations.push(table.text());
console.log(conjugations[0]);

})
.catch(function (err) {
});


})






最佳答案

使用Promise.all如果顺序很重要。

The Promise.all() method returns a single Promise that resolves when all of the promises passed as an iterable have resolved or when the iterable contains no promises. It rejects with the reason of the first promise that rejects.

保持事物有序的示例:

const verbs = ["hello", "world", "example"];

let timeout = 2000;
const promises = verbs.map(verb=>{
timeout -= 500;
return new Promise((resolve,reject)=>{
setTimeout(function(){
resolve(verb);
}, timeout);
});
});

Promise.all(promises).then(dataArray=>console.log(dataArray));

使用您的代码解决方案。

const promises = verbs.map((verb) => {
const URI = encodeURI("https://ru.wiktionary.org/wiki/" + verb);
var options = {
uri: URI,
transform: function(body) {
return cheerio.load(body);
}


};

return rp(options);
})

Promise.all(promises).then(dataArray=>{
dataArray.forEach(function($) {
let table = $('span#Русский.mw-headline').parent().nextAll('table').first();
conjugations.push(table.text());
console.log(conjugations[0]);
})
}).catch(function(err) {});

缺点是,如果一个请求失败,那么所有请求都会失败。

或者,您可以通过使用每个动词的索引来执行类似的操作(使用 Promise.all 来确定一切何时完成,但可以忽略该步骤...)

const verbs = ["hello", "world", "example"];

const conjugations = [];
let timeout = 2000;
const promises = verbs.map((verb, index)=>{
return new Promise((resolve, reject)=>{
setTimeout(function(){
conjugations[index] = verb;
resolve();
}, timeout);
timeout -= 500;
});
});

Promise.all(promises).then(()=>console.log(conjugations));

您的代码示例。

const request = require('request');
const rp = require('request-promise');
const cheerio = require('cheerio');
const fs = require('fs');

let verbs = [];
let conjugations = [];
fs.readFileSync('verbs.txt', 'utf-8').split(/\r?\n/).forEach(function(line) {
verbs.push(line);
});

verbs.forEach((verb, index) => {
const URI = encodeURI("https://ru.wiktionary.org/wiki/" + verb);


var options = {
uri: URI,
transform: function(body) {
return cheerio.load(body);
}
};

rp(options).then(function($) {
let table = $('span#Русский.mw-headline').parent().nextAll('table').first();
conjugations[index] = table.text();
console.log(conjugations[index]);

})
.catch(function(err) {});

关于javascript - 网络爬虫的异步请求,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54650372/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com