gpt4 book ai didi

utf-8 - UTF-8 直接存储代码点的基本原理是什么?

转载 作者:行者123 更新时间:2023-12-02 02:43:42 26 4
gpt4 key购买 nike

UTF-8 将代码点的有效位存储在代码单元的低位中

U+0000-U+007F       0xxxxxxx
U+0080-U+07FF 110xxxxx 10xxxxxx
U+0800-U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000-U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

这需要解码器检查 over long sequences (如 C0 80 而不是 00 )并且还将可编码的代码点数量减少到固定字节数。如果它使用相同的编码但像这样映射代码点
  • 前 128 个代码点 (U+0000—U+007F):1 个字节
  • 接下来的 2048 个代码点(U+0080—U+087F):2 个字节。例如。 C0 81 : U+0081
  • 接下来的 65536 个代码点(U+0880—U+1087F):3 个字节。例如。 E0 B0 B1 : U+0881
  • 接下来的 131072 个代码点(U+10880—U+10FFFF,最多 U+20880):4 个字节。例如。 F0 B0 B0 B1 : U+10881

  • (即该值编码到范围开始的偏移量)

    然后可以使用更短的序列编码更多的字符。解码也可能更快,因为它只需要添加一个常量,这通常比检查过长代码点的分支成本更低。事实上,如果我们从映射中删除代理对范围,可以将 2048 个字符压缩到 3 个字节中

    那么为什么 UTF-8 以这种方式存储代码点呢?

    最佳答案

    基本原理在“餐垫”轶事中有详细记录,该轶事讲述了当 Unicode 人员(实际上是 X/Open 的某个人)联系他们审查草稿时,Ken Thompson 和 Rob Pike 如何在餐厅的餐垫上制定规范规范。

    http://doc.cat-v.org/bell_labs/utf-8_history包含 Rob Pike 本人的叙述,以及他、Ken Thompson 和 X/Open 人员之间的通信。它将这一需求称为早期草案中缺失的关键部分之一:

    the ability to synchronize a byte stream picked up mid-run, with less that one character being consumed before synchronization



    换句话说,当您查看设置了高位的字节时,您可以仅从该字节值判断您是否处于 UTF-8 序列的中间,如果是,您需要倒带多远才能获得到多字节编码字符的开头。

    完整的故事非常值得一读,所以我将在这里简要总结一下。以下是 Wikipedia article's history section.的一部分的删节版

    By early 1992, the search was on for a good byte-stream encoding of multi-byte character sets. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. This encoding was not satisfactory on performance grounds, among other problems, and the biggest problem was probably that it did not have a clear separation between ASCII and non-ASCII ...

    In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multi-byte sequences would include only bytes where the high bit was set. ...

    In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it somewhat less bit-efficient than the previous proposal but crucially allowed it to be self-synchronizing, letting a reader start anywhere and immediately detect byte sequence boundaries. It also abandoned the use of biases and instead added the rule that only the shortest possible encoding is allowed; the additional loss in compactness is relatively insignificant, but readers now have to look out for invalid encodings to avoid reliability and especially security issues. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.

    关于utf-8 - UTF-8 直接存储代码点的基本原理是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57431095/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com