gpt4 book ai didi

unicode - 为什么 Unicode 被限制为 0x10FFFF?

转载 作者:行者123 更新时间:2023-12-04 06:11:07 25 4
gpt4 key购买 nike

为什么最大 Unicode 代码点被限制为 0x10FFFF?是否可以在此代码点上方表示 Unicode - 例如0x10FFFF + 0x000001 = 0x110000 - 通过任何编码方案,如 UTF-16、UTF-8?

最佳答案

这是因为UTF-16。 BMP 之外的字符使用 surrogate pair 表示在 UTF-16 中,第一个代码单元位于 0xD800–0xDBFF 之间,第二个代码单元位于 0xDC00–0xDFFF 之间。每个 CU 代表代码点的 10 位,总共允许 20 位 数据(0x100000 个字符)被拆分为 16架 (16×216 个字符)。剩余的 BMP 将代表 0x10000 个字符(代码点 0-0xFFFF)
因此,字符总数为 0x100000 + 0x10000 = 0x110000,这允许从 0 到 0x110000 - 1 = 0x10FFFF 的代码点。或者,最后一个可表示的代码点可以这样计算:BMP 中的代码点在 0-0xFFFF 范围内,所以用代理对编码的字符的偏移量是 0xFFFF + 1 = 0x10000,这意味着最后一个代码点是代理对代表是 0xFFFFF + 0x10000 = 0x10FFFF
这是由 Unicode Character Encoding Stability Policies 保证的上面的代码点将 永远不会被分配

The General_Category property value Surrogate (Cs) is immutable: the set of code points with that value will never change.


历史上 UTF-8 允许 up to U+7FFFFFFF using 6 bytes而 UTF-32 可以存储两倍的数量。然而,由于 UTF-16 的限制,Unicode 委员会决定 UTF-8 永远不能超过 4 个字节,导致与 UTF-16 的范围相同

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

https://en.wikipedia.org/wiki/UTF-8#History


这同样适用于 UTF-32

In November 2003, Unicode was restricted by RFC 3629 to match the constraints of the UTF-16 encoding: explicitly prohibiting code points greater than U+10FFFF (and also the high and low surrogates U+D800 through U+DFFF). This limited subset defines UTF-32

https://en.wikipedia.org/wiki/UTF-32


您可以阅读 this more detailed answer
  • Do UTF-8, UTF-16, and UTF-32 differ in the number of characters they can store?
  • Does the Unicode Consortium Intend to make UTF-16 run out of characters?
  • How many characters can be mapped with Unicode?
  • Proposal to restrict the range of code positions to the values up to U-0010FFFF
  • 关于unicode - 为什么 Unicode 被限制为 0x10FFFF?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52203351/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com