gpt4 book ai didi

c++ - 为什么 8'A' 可以是 char 类型,而 UTF-8 最多可以是 4 个字节,而 char 通常是 1 个字节?

转载 作者:可可西里 更新时间:2023-11-01 17:39:56 25 4
gpt4 key购买 nike

我正在阅读 What is the use of wchar_t in general programming?并在接受的答案中发现了一些令人困惑的东西:

It's more common to use char with a variable-width encoding e.g. UTF-8 or GB 18030.

我从我的课本中找到了这个:

image

用UTF-8编码的Unicode不是最多4个字节吗? char 对于大多数平台是 1 个字节。我是不是误解了什么?


更新:

经过搜索和阅读,现在我知道:

  1. 代码点和代码单元是不同的东西。代码点是唯一的,而代码单元依赖于编码。
  2. u8'a'(一个字符,这里不是字符串)只允许用于基本字符集(ASCII和它的控制字符的东西),它的值是对应的'a'的代码单元值,对于ascii字符,代码单元与代码点具有相同的值。 (这是@codekaizer的回答)
  3. std::string::size() 返回代码单元。

所以编辑们都在处理代码单元,对吗?如果我将文件编码从 utf8 更改为 uft32,那么 ə 的大小将是 4

最佳答案

Isn't unicode encoding with utf8 is at most 4 bytes?

根据 lex.ccon/3 , 强调我的:

A character literal that begins with u8, such as u8'w', is a character literal of type char, known as a UTF-8 character literal. The value of a UTF-8 character literal is equal to its ISO 10646 code point value, provided that the code point value is representable with a single UTF-8 code unit (that is, provided it is in the C0 Controls and Basic Latin Unicode block). If the value is not representable with a single UTF-8 code unit, the program is ill-formed. A UTF-8 character literal containing multiple c-chars is ill-formed.

单个UTF-8编码单元为1字节。

关于c++ - 为什么 8'A' 可以是 char 类型,而 UTF-8 最多可以是 4 个字节,而 char 通常是 1 个字节?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50343179/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com