gpt4 book ai didi

php - 如何用 PHP 解析维基百科 XML?

转载 作者:可可西里 更新时间:2023-11-01 00:11:15 24 4
gpt4 key购买 nike

如何用 PHP 解析维基百科 XML?我用 simplepie 试过了,但我什么也没得到。这是我想要获取其数据的链接。

http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml

编辑代码:

<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
echo $xml->api->query->pages->page->rev;
?>

最佳答案

我通常结合使用 CURL 和 XMLReader解析由 MediaWiki API 生成的 XML。

请注意,您必须在 User-Agent header 中包含您的电子邮件地址,否则 API 脚本将响应 HTTP 403 Forbidden。

下面是我如何初始化 CURL 句柄:

define("EMAIL_ADDRESS", "my@email.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

然后您可以使用此代码获取 XML 并在 $xml_reader 中构造一个新的 XMLReader 对象:

curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

编辑:这是一个工作示例:

<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

function extract_first_rev(XMLReader $xml_reader)
{
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "rev") {
$content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $content;
}
} else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
if ($xml_reader->name == "page") {
throw new Exception("Unexpectedly found `</page>`");
}
}
}

throw new Exception("Reached the end of the XML document without finding revision content");
}

$latest_rev = array();
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "page") {
$latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
}
}
}

function parse($rev)
{
global $ch;

curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
sleep(3);
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");

while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "text") {
$html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $html;
}
}
}

throw new Exception("Failed to parse");
}

foreach ($latest_rev as $title => $latest_rev) {
echo parse($latest_rev) . "\n";
}

关于php - 如何用 PHP 解析维基百科 XML?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4839938/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com