php - 如何使用 PHP 替换字符串中的非 SGML 字符？-6ren

php - 如何使用 PHP 替换字符串中的非 SGML 字符？

转载作者：太空狗更新时间：2023-10-29 15:06:27

我使用 PHP4 和 HTML 4.01(使用字符集 ISO-8859-15，即 latin-9)编写了一个留言簿。数据以字符集(ISO-8859-1，即 latin-1)保存在 MySQL 数据库中。

当有人输入来自不同字符集的字符时，浏览器似乎发送了经过编码的数据(实际上我还没有检查它的编码位置，...)。

无论如何，在某些情况下，似乎字符没有编码保存在数据库中。因此，当我在 HTML4.01 文档中添加显示数据时，验证器会返回一条错误消息:

non SGML character number 146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leavesundefined (among others) 65 character codes (0 to 31 inclusive and 127to 159 inclusive) that are sometimes used for typographical quotemarks and similar in proprietary character sets. The validator hasfound one of these undefined characters in your document. Thecharacter may appear on your browser as a curly quote, or a trademarksymbol, or some other fancy glyph; on a different computer, however,it will likely appear as a completely different character, or nothingat all.

Your best bet is to replace the character with the nearest equivalentASCII character, or to use an appropriate character entity. For moreinformation on Character Encoding on the web, see Alan Flavell'sexcellent HTML Character Set Issues reference.

This error can also be triggered by formatting characters embedded indocuments by some word processors. If you use a word processor to edityour HTML documents, be sure to use the "Save as ASCII" or similarcommand to save the document without formatting information.

我现在正在使用 PHP5.2.17，并尝试使用 htmlspecialchars，但没有任何效果。我如何对这些字符进行编码，以便不再出现验证错误？

最佳答案

在 ISO-8859-1 和 ISO-8859-15 中，字符编号 146 是来自 C1 range 的控制字符 MW(消息等待) .

SGML 指的是 ISO 8859-1(注意 ISO 和 8859-1 之间的空格，它不是您使用的字符集中的连字符)。它不允许控制字符但三个(此处:SGML in HTML):

In the HTML document character set only three control characters are allowed: HorizontalTab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

因此，您确实传递了一个非法字符。不存在可用于替换它的 SGML/HTML 实体。

我建议您验证进入您的应用程序的输入，它不允许控制字符。如果您认为这些字符最初代表有用的东西，例如可以实际读取的字母(例如，不是控制字符)，那么当您处理数据时，编码可能在某个时候被破坏。

从你问题中给出的信息中很难说出在哪里，因为你只指定了输入编码和数据库文件的编码 - 但是这两个已经不匹配(这不应该产生你问的问题关于，但它会产生其他问题)。在这两个地方旁边，还有数据库客户端连接字符集(在你的问题中未指定)、输出编码(在你的问题中未指定)和响应内容编码(在你的问题中未指定)。

将整体编码更改为 UTF-8 以支持更广泛的字符可能有意义，但这确实是一个可能。

编辑: 上面的部分有点严格。我想到您收到的输入实际上不是 ISO-8859-1(5)，而是其他东西，比如 Windows 代码页。我可能会说，它是 Windows-1252 (cp1252)^Wikipedia .与 ISO-8859-1 (128-159) 的 C1 范围相比，它有几个非控制字符。

维基百科页面还指出，大多数浏览器将 ISO-8859-1 视为 Windows-1252/CP1252/CP-1252。 PHP htmlentities() function无法处理这些字符，translation table对于 HTML 实体不涵盖代码点(PHP 5.3，未针对 5.4 进行测试)。您需要创建自己的转换表并将其与 strtr 一起使用替换 ISO 8859-15 中不适用于 windows-1252 的字符:

/*
 * mappings of Windows-1252 (cp1252)  128 (0x80) - 159 (0x9F) characters:
 * @link http://en.wikipedia.org/wiki/Windows-1252
 * @link http://www.w3.org/TR/html4/sgml/entities.html
 */
$cp1252HTML401Entities = array(
    "\x80" => '&euro;',    # 128 -> euro sign, U+20AC NEW
    "\x82" => '&sbquo;',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => '&fnof;',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '&bdquo;',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '&hellip;',  # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '&dagger;',  # 134 -> dagger, U+2020 ISOpub
    "\x87" => '&Dagger;',  # 135 -> double dagger, U+2021 ISOpub
    "\x88" => '&circ;',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '&permil;',  # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => '&Scaron;',  # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '&lsaquo;',  # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => '&OElig;',   # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => '&#381;',    # 142 -> U+017D
    "\x91" => '&lsquo;',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '&rsquo;',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '&ldquo;',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '&rdquo;',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '&bull;',    # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '&ndash;',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '&mdash;',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '&tilde;',   # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '&trade;',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => '&scaron;',  # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '&rsaquo;',  # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => '&oelig;',   # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => '&#382;',    # 158 -> U+017E
    "\x9F" => '&Yuml;',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

如果你想更安全，你可以保留命名实体，只选择数字实体，这也适用于非常老的浏览器:

$cp1252HTMLNumericEntities = array(
    "\x80" => '&#8364;',   # 128 -> euro sign, U+20AC NEW
    "\x82" => '&#8218;',   # 130 -> single low-9 quotation mark, U+201A NEW
    "\x83" => '&#402;',    # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
    "\x84" => '&#8222;',   # 132 -> double low-9 quotation mark, U+201E NEW
    "\x85" => '&#8230;',   # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
    "\x86" => '&#8224;',   # 134 -> dagger, U+2020 ISOpub
    "\x87" => '&#8225;',   # 135 -> double dagger, U+2021 ISOpub
    "\x88" => '&#710;',    # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
    "\x89" => '&#8240;',   # 137 -> per mille sign, U+2030 ISOtech
    "\x8A" => '&#352;',    # 138 -> latin capital letter S with caron, U+0160 ISOlat2
    "\x8B" => '&#8249;',   # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
    "\x8C" => '&#338;',    # 140 -> latin capital ligature OE, U+0152 ISOlat2
    "\x8E" => '&#381;',    # 142 -> U+017D
    "\x91" => '&#8216;',   # 145 -> left single quotation mark, U+2018 ISOnum
    "\x92" => '&#8217;',   # 146 -> right single quotation mark, U+2019 ISOnum
    "\x93" => '&#8220;',   # 147 -> left double quotation mark, U+201C ISOnum
    "\x94" => '&#8221;',   # 148 -> right double quotation mark, U+201D ISOnum
    "\x95" => '&#8226;',   # 149 -> bullet = black small circle, U+2022 ISOpub
    "\x96" => '&#8211;',   # 150 -> en dash, U+2013 ISOpub
    "\x97" => '&#8212;',   # 151 -> em dash, U+2014 ISOpub
    "\x98" => '&#732;',    # 152 -> small tilde, U+02DC ISOdia
    "\x99" => '&#8482;',   # 153 -> trade mark sign, U+2122 ISOnum
    "\x9A" => '&#353;',    # 154 -> latin small letter s with caron, U+0161 ISOlat2
    "\x9B" => '&#8250;',   # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
    "\x9C" => '&#339;',    # 156 -> latin small ligature oe, U+0153 ISOlat2
    "\x9E" => '&#382;',    # 158 -> U+017E
    "\x9F" => '&#376;',    # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

希望这对现在更有帮助。另请参阅上面链接的维基百科页面，了解 windows-1242 和 ISO 8859-15 但中的一些字符。您可能应该考虑在您的网站上使用 UTF-8。

关于php - 如何使用 PHP 替换字符串中的非 SGML 字符？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/9736949/

文章推荐： javascript - 如何使 html 滚动条从底部开始？

文章推荐： c - 重新编码 memset 时出现“段错误”

include - 我可以在 sgml 文档中引用外部 sgml 声明吗？
我比较习惯xml文档和工具，但是需要和sgml打交道。我有一个正在使用的 sgml 文档 nsgmls解析，我需要包含一个特定的 sgml 声明。如果我在命令行上指定声明文件，它就可以正常工作: $
Python:解析 SGML
我正在尝试在 Python 中解析一些 SGML，如下所示: One Sample One Two Sample Two 在这里，我只是在寻找中的所有内容
Eclipse IDE : SGML plugin?
您知道在 Eclipse 中编辑 SGML 文件的插件吗？最佳答案目前还没有这样的插件。关于Eclipse IDE : SGML plugin?，我们在Stack Overflow上找到一个类似
xpath - 如何匹配通过子节点继续的文本序列(例如使用 sgml 样式标记)？
Match this please Don't match this Match this please 像这样的表达: //thing[text()='Match this plea
c# - SGML 解析器 .NET 建议
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
java - 使用正则表达式将 SGML 转换为 XML？
我想使用 regex 将 SGML 转换为 XML。喜欢: 转换: 111222 至: 111222 我编写了以下代码来进行转换: String a = "abcabc2"; a = a.replac
python - Python 中的 SGML 解析器
我是 Python 的新手。我有以下代码: class ExtractTitle(sgmllib.SGMLParser): def __init__(self, verbose=0): sgml
html - CSS 中的 SGML "content"
是否可以使用 css :before { content:""} 语句输出 SGML 字符？这不起作用: span:before { content:"√" } 转义好像也不行。最佳答
HTML5 不基于 SGML，因此不需要引用 DTD
发件人:http://www.w3schools.com/tags/tag_doctype.asp The declaration is not an HTML tag; it is an inst
xml - 何时使用 SGML 和 XML？
这个问题等同于:“使用 SGML 优于 XML，以及使用 XML 优于 SGML 的优点和缺点是什么？”。我已经知道 SGML 和 XML 之间的一些相同点和不同点，但他们没有回答这个问题。相似之
Java SGML 到 XML 的转换？
有人知道将 SGML 转换为 XML 的方法或库吗？编辑:澄清一下，我必须用 Java 进行转换，而且我不能使用 SP 解析器或相关的 SX 工具。最佳答案似乎普遍的共识是，在 Java 中没有
xml - SGML 和 XML 有什么区别？
关闭。这个问题需要更多focused .它目前不接受答案。想改进这个问题吗？更新问题，使其只关注一个问题 editing this post . 关闭 6 年前。 Improve this qu
xml - 是否有理由使用 SGML 而不是 XML？
据我了解，XML 是 SGML 的一个子集，旨在简化它并鼓励更广泛的使用。我想大多数有用的特性都被引入了 XML，但是 SGML 中是否有任何强大到足以激励使用它而不是 XML 的特性(并接受复杂性
markdown - 用于 Markdown 的 SGML 解析器可能吗？
SGML 有许多允许标记最小化的可选特性，例如可选或隐含的开始和结束标记，以及用于更简单的标记别名的 SHORTREF。因此是否有可能编写一个 DTD，一个完美的 SGML 实现，这一直是一个罕见的甚
java - 使用 SGML 解析 Java 字符串
我有一个带有 SGML 的 Java 字符串，类似这样...... I know you ducky suck and I rocky rock 我如何解析它以获取例如内的文本以便让“鸭子”出去
java - 将 HTML 解析器与 SGML 结合使用
我想将 XML 解析器与 SGML 文档一起使用，但这不起作用。阅读了一些建议后，解决这个问题的唯一方法似乎是使用 HTML 解析器。所以我基本上只是想做一个简单的查询，从我的文档中提取故事标题。 (
c - 纯 C 中的 SGML 解析器
我正在寻找一个用纯 C 编写的开源 SGML 解析器。这是为了解析真正的 SGML，而不是格式错误的东西。有什么想法吗？最佳答案有 OpenSP，它是 OpenJade 的一部分项目，但在 C+
php - 如何使用 PHP 替换字符串中的非 SGML 字符？
我使用 PHP4 和 HTML 4.01(使用字符集 ISO-8859-15，即 latin-9)编写了一个留言簿。数据以字符集(ISO-8859-1，即 latin-1)保存在 MySQL 数据库中
c# - sgml/xml 中的参数实体引用是否可以使用 .NET 解析？
当我尝试使用 XDocument 解析以下数据时，出现以下错误: “XMLException:内部标记中不允许参数实体引用” 这是我要解析的示例数据: ]> &questio
perl - 为什么 SGML::Parser::OpenSP 找不到符号 __ZTI15SGMLApplication？
我正在尝试从 cpan shell“安装 SGML::Parser::OpenSP”，但在第一次“make test”时失败。如果我进入构建目录并运行 make test，我也会得到同样的错误。我相

太空狗

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

php - 如何使用 PHP 替换字符串中的非 SGML 字符？