C编程: How to program for Unicode?-6ren

C编程: How to program for Unicode?

转载作者：行者123 更新时间：2023-11-30 16:52:17

进行严格的 Unicode 编程需要什么先决条件？

这是否意味着我的代码不应在任何地方使用 char 类型，并且需要使用可以处理 wint_t 和 wchar_t 的函数？

多字节字符序列在这种情况下扮演什么角色？

最佳答案

C99 或更早版本

C 标准 (C99) 提供了宽字符和多字节字符，但由于不能保证这些宽字符可以容纳什么，因此它们的值(value)有些有限。对于给定的实现，它们提供了有用的支持，但如果您的代码必须能够在实现之间移动，则无法充分保证它们有用。

因此，Hans van Eck 建议的方法(即围绕 ICU - International Components for Unicode - 库编写一个包装器)是合理的，IMO。

UTF-8 编码有很多优点，其中之一是，如果您不弄乱数据(例如，通过截断数据)，那么可以通过不完全了解其复杂性的函数来复制它。 UTF-8 编码。 wchar_t 绝对不是这种情况。 .

完整的 Unicode 是一种 21 位格式。也就是说，Unicode 保留从 U+0000 到 U+10FFFF 的代码点。

UTF-8、UTF-16 和 UTF-32 格式(其中 UTF 代表 Unicode 转换格式 - 请参阅 Unicode)的有用之处之一是您可以在三种表示形式之间进行转换，而不会丢失信息。每个人都可以代表其他人可以代表的任何东西。 UTF-8 和 UTF-16 都是多字节格式。

众所周知，UTF-8 是一种多字节格式，具有仔细的结构，可以从字符串中的任何点开始可靠地找到字符串中的字符开头。单字节字符的高位设置为零。多字节字符的第一个字符以位模式 110、1110 或 11110 之一开始(对于 2 字节、3 字节或 4 字节字符)，后续字节始终从 10 开始。连续字符始终位于范围 0x80 .. 0xBF。有一些规则规定 UTF-8 字符必须以尽可能最小的格式表示。这些规则的结果之一是字节 0xC0 和 0xC1(以及 0xF5..0xFF)不能出现在有效的 UTF-8 数据中。

 U+0000 ..   U+007F  1 byte   0xxx xxxx
 U+0080 ..   U+07FF  2 bytes  110x xxxx   10xx xxxx
 U+0800 ..   U+FFFF  3 bytes  1110 xxxx   10xx xxxx   10xx xxxx
U+10000 .. U+10FFFF  4 bytes  1111 0xxx   10xx xxxx   10xx xxxx   10xx xxxx

最初，人们希望 Unicode 是一个 16 位代码集，并且所有内容都适合 16 位代码空间。不幸的是，现实世界更加复杂，必须扩展到当前的 21 位编码。

UTF-16 因此是“基本多语言平面”的单个单元(16 位字)代码集，表示具有 Unicode 代码点 U+0000 .. U+FFFF 的字符，但使用两个单元(32-位)用于此范围之外的字符。因此，使用 UTF-16 编码的代码必须能够处理可变宽度编码，就像 UTF-8 一样。双单位字符的代码称为代理项。

Surrogates are code points from two special ranges of Unicode values, reserved for use as the leading, and trailing values of paired code units in UTF-16. Leading, also called high, surrogates are from U+D800 to U+DBFF, and trailing, or low, surrogates are from U+DC00 to U+DFFF. They are called surrogates, since they do not represent characters directly, but only as a pair.

UTF-32 当然可以在单个存储单元中对任何 Unicode 代码点进行编码。它的计算效率很高，但存储效率不高。

您可以在ICU找到更多信息。和 Unicode 网站。

C11 和 `<uchar.h>`

C11 标准改变了规则，但即使是现在(2017 年中)，也并非所有实现都跟上了这些变化。 C11 标准将 Unicode 支持的变化总结为:

Unicode characters and strings (<uchar.h>) (originally specified in ISO/IEC TR 19769:2004)

接下来是该功能的最简概要。规范包括:

6.4.3 Universal character names

Syntax
universal-character-name:
    \u hex-quad
    \U hex-quad hex-quad
hex-quad:
    hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit

7.28 Unicode utilities <uchar.h>

The header <uchar.h> declares types and functions for manipulating Unicode characters.

The types declared are mbstate_t (described in 7.29.1) and size_t (described in 7.19);
char16_t
which is an unsigned integer type used for 16-bit characters and is the same type as uint_least16_t (described in 7.20.1.2); and
char32_t
which is an unsigned integer type used for 32-bit characters and is the same type as uint_least32_t (also described in 7.20.1.2).

(翻译交叉引用: <stddef.h> 定义 size_t ， <wchar.h>定义mbstate_t ,和<stdint.h>定义uint_least16_t和uint_least32_t .)<uchar.h> header 还定义了一组最小的(可重新启动的)转换函数: