gpt4 book ai didi

c++ - 将 Unicode 字符串作为字符循环

转载 作者:行者123 更新时间:2023-12-04 14:40:38 28 4
gpt4 key购买 nike

使用以下字符串,大小输出不正确。这是为什么,我该如何解决?

string str = " ██████";
cout << str.size();
// outputs 19 rather than 7

我正在尝试遍历 str一个字符一个字符,这样我就可以把它读成 vector<string>它的大小应该是 7,但我不能这样做,因为上面的代码输出 19。

最佳答案

TL; 博士size()length() basic_string 成员(member)返回以底层字符串为单位的大小,不是 可见字符数 .要获得预期的数字:

  • 将 UTF-16 与 u 一起使用不包含非 BMP 的非常简单字符串的前缀,没有 combining characters并且没有 joining characters
  • 将 UTF-32 与 U 一起使用不包含任何组合或连接字符的非常简单字符串的前缀
  • 规范化任意 Unicode 字符串的字符串和计数
  • " ██████"是一个空格,后跟一系列 6 U+2588人物。您的编译器似乎正在使用 UTF-8std::string . UTF-8 是 variable-length encoding并且许多字母使用多个字节进行编码(因为很明显你不能只用一个字节编码超过 256 个字符)。在 UTF-8 中,U+0800 和 U+FFFF 之间的代码点由 3 个字节编码。因此UTF-8中字符串的长度是 1 + 6*3 = 19 字节。
    您可以使用任何 Unicode 转换器进行检查,例如 this one并看到字符串被编码为 20 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88 E2 96 88在 UTF-8 中,您还可以遍历字符串的每个字节来检查
    如果您想要 可见字符总数在字符串中,那么它会更加棘手和 丘里尔的解决方案不起作用 .阅读 Twitter 中的示例

    If you use anything beyond the most basic letters, numbers, and punctuation the situation gets more confusing. While many people use multi-byte Kanji characters to exemplify these issues, Twitter has found that accented vowels cause the most confusion because English speakers simply expect them to work. Take the following example: the word “café”. It turns out there are two byte sequences that look exactly the same, but use a different number of bytes:

    café  0x63 0x61 0x66 0xC3 0xA9        Using the “é” character, called the “composed character”.
    café 0x63 0x61 0x66 0x65 0xCC 0x81 Using the combining diacritical, which overlaps the “e”

    您需要一个 Unicode 库,例如 ICUnormalize字符串和计数。例如 Twitter 使用 Normalization Form C
    编辑:
    由于您只对似乎不在 BMP 之外且不包含任何组合字符的框绘图字符感兴趣,因此 UTF-16 和 UTF-32 将起作用。喜欢 std::string , std::wstring也是 basic_string并且没有强制编码。在大多数实现中,它通常是 UTF-16 (Windows) 或 UTF-16 (*nix),因此您可以使用它,但它不可靠并且取决于源代码编码。更好的方法是使用 std::u16string ( std::basic_string<char16_t> ) 和 std::u32string ( std::basic_string<char32_t> )。无论源文件的系统和编码如何,它们都可以工作
    std::wstring wstr     = L" ██████";
    std::u16string u16str = u" ██████";
    std::u32string u32str = U" ██████";
    std::cout << str.size(); // may work, returns the number of wchar_t characters
    std::cout << u16str.size(); // always returns the number of UTF-16 code units
    std::cout << u32str.size(); // always returns the number of UTF-32 code units
    如果您对如何解决所有 Unicode 字符的问题感兴趣,请继续阅读以下内容

    The “café” issue mentioned above raises the question of how you count the characters in the Tweet string “café”. To the human eye the length is clearly four characters. Depending on how the data is represented this could be either five or six UTF-8 bytes. Twitter does not want to penalize a user for the fact we use UTF-8 or for the fact that the API client in question used the longer representation. Therefore, Twitter does count “café” as four characters, no matter which representation is sent.

    [...]

    Twitter counts the length of a Tweet using the Normalization Form C (NFC) version of the text. This type of normalization favors the use of a fully combined character (0xC3 0xA9 from the café example) over the long-form version (0x65 0xCC 0x81). Twitter also counts the number of codepoints in the text rather than UTF-8 bytes. The 0xC3 0xA9 from the café example is one codepoint (U+00E9) that is encoded as two bytes in UTF-8, whereas 0x65 0xCC 0x81 is two codepoints encoded as three bytes

    Twitter - Counting characters


    也可以看看
  • When "Zoë" !== "Zoë". Or why you need to normalize Unicode strings
  • Getting Twitter characters count
  • Why is the length of this string longer than the number of characters in it?
  • 关于c++ - 将 Unicode 字符串作为字符循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58465193/

    28 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com