gpt4 book ai didi

PHP DOM UTF-8 问题

转载 作者:行者123 更新时间:2023-12-01 18:09:46 25 4
gpt4 key购买 nike

首先,我的数据库使用 Windows-1250 作为 native 字符集。我将数据输出为 UTF-8。我在网站上到处使用 iconv() 函数将 Windows-1250 字符串转换为 UTF-8 字符串,效果非常好。

问题是当我使用 PHP DOM 解析数据库中存储的一些 HTML 时(HTML 是 WYSIWYG 编辑器的输出,无效,它没有 html、head、body 标签等)。

HTML 可能看起来像这样,例如:

<p>Hello</p>

这是我用来从数据库解析特定 HTML 的方法:

 private function ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

$doc = new DOMDocument('1.0', 'UTF-8');

// hack to preserve UTF-8 characters
$html = iconv('Windows-1250', 'UTF-8', $slideContent);
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->preserveWhiteSpace = false;

foreach($doc->getElementsByTagName('img') as $t) {
$path = trim($t->getAttribute('src'));
$t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
}
foreach ($doc->getElementsByTagName('object') as $o) {
foreach ($o->getElementsByTagName('param') as $p) {
$path = trim($p->getAttribute('value'));
$p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
}
}
foreach ($doc->getElementsByTagName('embed') as $e) {
if (true === $e->hasAttribute('pluginspage')) {
$path = trim($e->getAttribute('src'));
$e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
} else {
$path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
$path = 'data/media/video/' . $path;
$path = '/clientarea/utils/locate-video?path=' . urlencode($path);
$width = $e->getAttribute('width') . 'px';
$height = $e->getAttribute('height') . 'px';
$a = $doc->createElement('a', '');
$a->setAttribute('href', $path);
$a->setAttribute('style', "display:block;width:$width;height:$height;");
$a->setAttribute('class', 'player');
$e->parentNode->replaceChild($a, $e);
$this->slideContainsVideo = true;
}
}

$html = trim($doc->saveHTML());

$html = explode('<body>', $html);
$html = explode('</body>', $html[1]);
return $html[0];
}

上述方法的输出是垃圾,所有特殊字符都被替换为像 这样的奇怪的东西。

还有一件事。它在我的开发服务器上确实有效

但它在生产服务器上不起作用。

有什么建议吗?

生产服务器的PHP版本:PHP Version 5.2.0RC4-dev

开发服务器的PHP版本:PHP Version 5.2.13


更新:

我自己正在研究解决方案。我从这个 PHP 错误报告中得到了灵感(虽然不是真正的错误):http://bugs.php.net/bug.php?id=32547

这是我提出的解决方案。我明天会尝试一下,然后告诉您是否有效:

 private function ParseSlideContent($slideContent)
{
var_dump(iconv('Windows-1250', 'UTF-8', $slideContent)); // this outputs the HTML ok with all special characters

$doc = new DOMDocument('1.0', 'UTF-8');

// hack to preserve UTF-8 characters
$html = iconv('Windows-1250', 'UTF-8', $slideContent);
$doc->loadHTML('<?xml encoding="UTF-8">' . $html);
$doc->preserveWhiteSpace = false;

// this might work
// it basically just adds head and meta tags to the document
$html = $doc->getElementsByTagName('html')->item(0);
$head = $doc->createElement('head', '');
$meta = $doc->createElement('meta', '');
$meta->setAttribute('http-equiv', 'Content-Type');
$meta->setAttribute('content', 'text/html; charset=utf-8');
$head->appendChild($meta);
$body = $doc->getElementsByTagName('body')->item(0);
$html->removeChild($body);
$html->appendChild($head);
$html->appendChild($body);

foreach($doc->getElementsByTagName('img') as $t) {
$path = trim($t->getAttribute('src'));
$t->setAttribute('src', '/clientarea/utils/locate-image?path=' . urlencode($path));
}
foreach ($doc->getElementsByTagName('object') as $o) {
foreach ($o->getElementsByTagName('param') as $p) {
$path = trim($p->getAttribute('value'));
$p->setAttribute('value', '/clientarea/utils/locate-flash?path=' . urlencode($path));
}
}
foreach ($doc->getElementsByTagName('embed') as $e) {
if (true === $e->hasAttribute('pluginspage')) {
$path = trim($e->getAttribute('src'));
$e->setAttribute('src', '/clientarea/utils/locate-flash?path=' . urlencode($path));
} else {
$path = end(explode('data/media/video/', trim($e->getAttribute('src'))));
$path = 'data/media/video/' . $path;
$path = '/clientarea/utils/locate-video?path=' . urlencode($path);
$width = $e->getAttribute('width') . 'px';
$height = $e->getAttribute('height') . 'px';
$a = $doc->createElement('a', '');
$a->setAttribute('href', $path);
$a->setAttribute('style', "display:block;width:$width;height:$height;");
$a->setAttribute('class', 'player');
$e->parentNode->replaceChild($a, $e);
$this->slideContainsVideo = true;
}
}

$html = trim($doc->saveHTML());

$html = explode('<body>', $html);
$html = explode('</body>', $html[1]);
return $html[0];
}

最佳答案

你的“黑客”没有意义。

您正在将 Windows-1250 HTML 文件转换为 UTF-8,然后在前面添加 <?xml encoding="UTF-8"> 。这行不通。 DOM 扩展,适用于 HTML 文件:

  • 采用元 http-equiv 中为“内容类型”指定的字符集。
  • 否则采用 ISO-8859-1

我建议您从 Windows-1250 转换为 ISO-8859-1 并且不添加任何内容。

编辑 该建议不是很好,因为 Windows-1250 包含 ISO-8859-1 中不存在的字符。因为您正在处理没有 meta 的片段content-type 元素,您可以添加自己的元素以强制解释为 UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
"<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" />" .
iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

给出:

string(79) "ĄĽź ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"

关于PHP DOM UTF-8 问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3548880/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com