gpt4 book ai didi

python - 如何获取维基项目的维基百科数据?

转载 作者:太空宇宙 更新时间:2023-11-04 08:28:36 25 4
gpt4 key购买 nike

我最近发现维基百科有 Wikiprojects,它们是根据 discipline ( https://en.wikipedia.org/wiki/Category:WikiProjects_by_discipline ) 分类的。如链接所示,它有 34 个学科。

我想知道是否有可能获得与这些维基百科学科相关的所有维基百科文章。

例如,考虑 WikiProject Computer science 。是否可以使用 WikiProject Computer science 类别获取所有与计算机科学相关的维基百科文章?如果有,是否有与之相关的数据转储,或者是否有其他方式获取这些数据?

我目前正在使用 python(即 pywikibotpymediawiki)。不过,我也很高兴收到其他语言的答复。

如果需要,我很乐意提供更多详细信息。

最佳答案

正如我在@arash 的回答中所建议和添加的那样,您可以使用维基百科 API 来获取维基百科数据。这是关于如何做到这一点的描述的链接,API:Categorymembers#GET_request

正如您所说,您需要使用程序获取数据,下面是 JavaScript 中的示例代码。它将从 Category:WikiProject_Computer_science_articles 中获取前 500 个名称并显示为输出。您可以根据此示例转换您选择的语言:

// Importing the module
const fetch = require('node-fetch');

// URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

// Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
console.log(t.query.categorymembers[i].title);
}
});

要将数据写入文件,你可以像下面那样做:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = [];
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles[i] = title;
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

上面的代码会将数据存储在一个文件中,用 , 分隔,因为我们在那里使用了 JavaScript 数组。如果你想在没有逗号的情况下存储在每一行中,那么你需要这样做:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');

//URL with resources to fetch
const url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmprop.ids=1&cmlimit=500";

//Fetching using 'node-fetch'
fetch(url).then(res => res.json()).then(t => {
// Getting the length of the returned array
let len = t.query.categorymembers.length;
// Initializing an empty array
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = t.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
fs.writeFileSync('pathtotitles\\titles.txt', titles);
});

通过使用 cmlimit,我们无法获取超过 500 个标题,因此我们需要使用 cmcontinue 来检查和获取下一页...

尝试下面的代码获取特定类别的所有标题并打印,将数据附加到文件:

//Importing the modules
const fetch = require('node-fetch');
const fs = require('fs');
//URL with resources to fetch
var url = "https://en.wikipedia.org/w/api.php?action=query&format=json&list=categorymembers&cmtitle=Category%3AWikiProject_Computer_science_articles&cmlimit=500";

// Method to fetch and append the data to a file
var fetchTheData = async (url, index) => {
return await fetch(url).then(res => res.json()).then(data => {
// Getting the length of the returned array
let len = data.query.categorymembers.length;
// Initializing an empty string
let titles = '';
// Iterating over all the response data
for(let i=0;i<len;i++) {
// Printing the names
let title = data.query.categorymembers[i].title;
console.log(title);
titles += title + "\n";
}
// Appending to the file
fs.appendFileSync('pathtotitles\\titles.txt', titles);
// Handling an end of error fetching titles exception
try {
return data.continue.cmcontinue;
} catch(err) {
return "===>>> Finished Fetching...";
}
});
}

// Method which will construct the next URL with next page to fetch the data
var constructNextPageURL = async (url) => {
// Getting the next page token
let nextPage = await fetchTheData(url);
for(let i=1;i<=14;i++) {
await console.log("=> The next page URL is : "+(url + '&cmcontinue=' + nextPage));
// Constructing the next page URL with next page token and sending the fetch request
nextPage = await fetchTheData(url + '&cmcontinue=' + nextPage);
}
}

// Calling to begin extraction
constructNextPageURL(url);

希望对你有帮助

关于python - 如何获取维基项目的维基百科数据?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54729496/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com