gpt4 book ai didi

unicode - 6 个八位字节的 UTF-8 序列是否有效?

转载 作者:行者123 更新时间:2023-12-04 06:27:53 25 4
gpt4 key购买 nike

UTF-8 可以编码 5 或 6 字节序列,允许编码所有 Unicode 字符吗?我得到了相互矛盾的标准。我需要能够支持 每个 Unicode 字符 ,而不仅仅是 U+0000..U+10FFFF 范围内的那些。

(所有报价均来自 RFC 3629)

第 3 节:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.



所以不是所有可能的字符都可以用 UTF-8 编码?这是否意味着我不能对来自与 BMP 不同平面的字符进行编码?

第 2 节:

The octet values C0, C1, F5 to FF never appear.



这意味着我们不能用 5 或 6 个八位字节(或者甚至一些不在上述范围内的 4 个八位字节)对 UTF-8 值进行编码?

第 12 节:

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).



查看之前的 RFC 证实了这一点……他们减少了字符的范围。

第 10 节:

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.



所以这些序列是根据 ISO/IEC 10646 定义允许的,但不是 RFC 3629 定义?我应该遵循哪一个?

提前致谢。

最佳答案

他们不是 Unicode超过 10FFFF 的字符,BMP 覆盖 0000 到 FFFF。

UTF-8对于 0-10FFFF 是明确定义的。

关于unicode - 6 个八位字节的 UTF-8 序列是否有效?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3559161/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com