gpt4 book ai didi

php - 如何使用 PHP 替换字符串中的非 SGML 字符?

转载 作者:太空狗 更新时间:2023-10-29 15:06:27 25 4
gpt4 key购买 nike

我使用 PHP4 和 HTML 4.01(使用字符集 ISO-8859-15,即 latin-9)编写了一个留言簿。数据以字符集(ISO-8859-1,即 latin-1)保存在 MySQL 数据库中。

当有人输入来自不同字符集的字符时,浏览器似乎发送了经过编码的数据(实际上我还没有检查它的编码位置,...)。

无论如何,在某些情况下,似乎字符没有编码保存在数据库中。因此,当我在 HTML4.01 文档中添加显示数据时,验证器会返回一条错误消息:

non SGML character number 146

You have used an illegal character in your text. HTML uses the standard UNICODE Consortium character repertoire, and it leavesundefined (among others) 65 character codes (0 to 31 inclusive and 127to 159 inclusive) that are sometimes used for typographical quotemarks and similar in proprietary character sets. The validator hasfound one of these undefined characters in your document. Thecharacter may appear on your browser as a curly quote, or a trademarksymbol, or some other fancy glyph; on a different computer, however,it will likely appear as a completely different character, or nothingat all.

Your best bet is to replace the character with the nearest equivalentASCII character, or to use an appropriate character entity. For moreinformation on Character Encoding on the web, see Alan Flavell'sexcellent HTML Character Set Issues reference.

This error can also be triggered by formatting characters embedded indocuments by some word processors. If you use a word processor to edityour HTML documents, be sure to use the "Save as ASCII" or similarcommand to save the document without formatting information.

我现在正在使用 PHP5.2.17,并尝试使用 htmlspecialchars,但没有任何效果。我如何对这些字符进行编码,以便不再出现验证错误?

最佳答案

在 ISO-8859-1 和 ISO-8859-15 中,字符编号 146 是来自 C1 range 的控制字符 MW(消息等待) .

SGML 指的是 ISO 8859-1(注意 ISO 和 8859-1 之间的空格,它不是您使用的字符集中的连字符)。它不允许控制字符但三个(此处:SGML in HTML):

In the HTML document character set only three control characters are allowed: HorizontalTab, Carriage Return, and Line Feed (code positions 9, 13, and 10).

因此,您确实传递了一个非法字符。不存在可用于替换它的 SGML/HTML 实体。

我建议您验证进入您的应用程序的输入,它不允许控制字符。如果您认为这些字符最初代表有用的东西,例如可以实际读取的字母(例如,不是控制字符),那么当您处理数据时,编码可能在某个时候被破坏。

从你问题中给出的信息中很难说出在哪里,因为你只指定了输入编码和数据库文件的编码 - 但是这两个已经不匹配(这不应该产生你问的问题关于,但它会产生其他问题)。在这两个地方旁边,还有数据库客户端连接字符集(在你的问题中未指定)、输出编码(在你的问题中未指定)和响应内容编码(在你的问题中未指定)。

将整体编码更改为 UTF-8 以支持更广泛的字符可能有意义,但这确实是一个可能

编辑: 上面的部分有点严格。我想到您收到的输入实际上不是 ISO-8859-1(5),而是其他东西,比如 Windows 代码页。我可能会说,它是 Windows-1252 (cp1252)­Wikipedia .与 ISO-8859-1 (128-159) 的 C1 范围相比,它有几个非控制字符。

维基百科页面还指出,大多数浏览器将 ISO-8859-1 视为 Windows-1252/CP1252/CP-1252。 PHP htmlentities() function无法处理这些字符,translation table对于 HTML 实体不涵盖代码点(PHP 5.3,未针对 5.4 进行测试)。您需要创建自己的转换表并将其与 strtr 一起使用替换 ISO 8859-15 中不适用于 windows-1252 的字符:

/*
* mappings of Windows-1252 (cp1252) 128 (0x80) - 159 (0x9F) characters:
* @link http://en.wikipedia.org/wiki/Windows-1252
* @link http://www.w3.org/TR/html4/sgml/entities.html
*/
$cp1252HTML401Entities = array(
"\x80" => '€', # 128 -> euro sign, U+20AC NEW
"\x82" => '‚', # 130 -> single low-9 quotation mark, U+201A NEW
"\x83" => 'ƒ', # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
"\x84" => '„', # 132 -> double low-9 quotation mark, U+201E NEW
"\x85" => '…', # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
"\x86" => '†', # 134 -> dagger, U+2020 ISOpub
"\x87" => '‡', # 135 -> double dagger, U+2021 ISOpub
"\x88" => 'ˆ', # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
"\x89" => '‰', # 137 -> per mille sign, U+2030 ISOtech
"\x8A" => 'Š', # 138 -> latin capital letter S with caron, U+0160 ISOlat2
"\x8B" => '‹', # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
"\x8C" => 'Œ', # 140 -> latin capital ligature OE, U+0152 ISOlat2
"\x8E" => 'Ž', # 142 -> U+017D
"\x91" => '‘', # 145 -> left single quotation mark, U+2018 ISOnum
"\x92" => '’', # 146 -> right single quotation mark, U+2019 ISOnum
"\x93" => '“', # 147 -> left double quotation mark, U+201C ISOnum
"\x94" => '”', # 148 -> right double quotation mark, U+201D ISOnum
"\x95" => '•', # 149 -> bullet = black small circle, U+2022 ISOpub
"\x96" => '–', # 150 -> en dash, U+2013 ISOpub
"\x97" => '—', # 151 -> em dash, U+2014 ISOpub
"\x98" => '˜', # 152 -> small tilde, U+02DC ISOdia
"\x99" => '™', # 153 -> trade mark sign, U+2122 ISOnum
"\x9A" => 'š', # 154 -> latin small letter s with caron, U+0161 ISOlat2
"\x9B" => '›', # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
"\x9C" => 'œ', # 156 -> latin small ligature oe, U+0153 ISOlat2
"\x9E" => 'ž', # 158 -> U+017E
"\x9F" => 'Ÿ', # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

$outputWithEntities = strtr($output, $cp1252HTML401Entities);

如果你想更安全,你可以保留命名实体,只选择数字实体,这也适用于非常老的浏览器:

$cp1252HTMLNumericEntities = array(
"\x80" => '€', # 128 -> euro sign, U+20AC NEW
"\x82" => '‚', # 130 -> single low-9 quotation mark, U+201A NEW
"\x83" => 'ƒ', # 131 -> latin small f with hook = function = florin, U+0192 ISOtech
"\x84" => '„', # 132 -> double low-9 quotation mark, U+201E NEW
"\x85" => '…', # 133 -> horizontal ellipsis = three dot leader, U+2026 ISOpub
"\x86" => '†', # 134 -> dagger, U+2020 ISOpub
"\x87" => '‡', # 135 -> double dagger, U+2021 ISOpub
"\x88" => 'ˆ', # 136 -> modifier letter circumflex accent, U+02C6 ISOpub
"\x89" => '‰', # 137 -> per mille sign, U+2030 ISOtech
"\x8A" => 'Š', # 138 -> latin capital letter S with caron, U+0160 ISOlat2
"\x8B" => '‹', # 139 -> single left-pointing angle quotation mark, U+2039 ISO proposed
"\x8C" => 'Œ', # 140 -> latin capital ligature OE, U+0152 ISOlat2
"\x8E" => 'Ž', # 142 -> U+017D
"\x91" => '‘', # 145 -> left single quotation mark, U+2018 ISOnum
"\x92" => '’', # 146 -> right single quotation mark, U+2019 ISOnum
"\x93" => '“', # 147 -> left double quotation mark, U+201C ISOnum
"\x94" => '”', # 148 -> right double quotation mark, U+201D ISOnum
"\x95" => '•', # 149 -> bullet = black small circle, U+2022 ISOpub
"\x96" => '–', # 150 -> en dash, U+2013 ISOpub
"\x97" => '—', # 151 -> em dash, U+2014 ISOpub
"\x98" => '˜', # 152 -> small tilde, U+02DC ISOdia
"\x99" => '™', # 153 -> trade mark sign, U+2122 ISOnum
"\x9A" => 'š', # 154 -> latin small letter s with caron, U+0161 ISOlat2
"\x9B" => '›', # 155 -> single right-pointing angle quotation mark, U+203A ISO proposed
"\x9C" => 'œ', # 156 -> latin small ligature oe, U+0153 ISOlat2
"\x9E" => 'ž', # 158 -> U+017E
"\x9F" => 'Ÿ', # 159 -> latin capital letter Y with diaeresis, U+0178 ISOlat2
);

希望这对现在更有帮助。另请参阅上面链接的维基百科页面,了解 windows-1242 ISO 8859-15 中的一些字符。您可能应该考虑在您的网站上使用 UTF-8。

关于php - 如何使用 PHP 替换字符串中的非 SGML 字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9736949/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com