gpt4 book ai didi

unicode - 为什么从Unicode字符集中删除了U + D800到U + DFFF范围内的代码点?

转载 作者:行者123 更新时间:2023-12-04 22:57:12 33 4
gpt4 key购买 nike

我正在学习有关UTF-16编码的信息,并且我读过如果要表示U + 10000到U + 10FFFF范围内的代码点,则必须使用代理对,即U + D800范围到U + DFFF。

因此,假设我要编码以下代码点:U + 10123(二进制为10000000100100011):

首先,我将按以下顺序排列这些位:

110110xxxxxxxxxx 110111xxxxxxxxxx

然后,我用代码点的二进制格式用x填充位置:

1101100001000000 1101110100100011(十六进制的D840 DD23)

我还读到U + D800到U + DFFF范围内的代码点已从Unicode字符集中删除,但我不明白为什么要删除此范围!

我的意思是该范围可以轻松地以4个字节进行编码,例如以下是U + D812代码点的UTF-16编码格式(二进制为1101100000010010):

1101100000110110 1101110000010010(D836 DC12以十六进制表示)

注意:我在示例中使用的是UTF-16 Big Endian。

最佳答案

代码点U + D800-U + DFFF专门保留用于UTF-16。由于它们不在U + 10000-U + 10FFFF的范围内,因此UTF-16不会使用代理对对它们进行单独编码,因此这些单独的代码点在UTF- 16个序列。

根据Unicode.org UTF-16 FAQ:

1:Q: What are surrogates?

A: Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from D80016 to DBFF16, and trailing, or low, surrogates are from DC0016 to DFFF16. They are called surrogates, since they do not represent characters directly, but only as a pair.



2: Q: Are there any 16-bit values that are invalid?

A: Unpaired surrogates are invalid in UTFs. These include any value in the range D80016 to DBFF16 not followed by a value in the range DC0016 to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a value in the range D80016 to DBFF16.

关于unicode - 为什么从Unicode字符集中删除了U + D800到U + DFFF范围内的代码点?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40184882/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com