gpt4 book ai didi

c++ - 将 utf8 实体从 json 解码为 utf8 C++

转载 作者:搜寻专家 更新时间:2023-10-31 02:13:48 25 4
gpt4 key购买 nike

我有一个包含 utf8 实体的字符串(我不确定我是否正确命名它):

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

如何将其转换为更具可读性的内容?我使用支持 C++11 的 g++,但在 std::codecvt 手册中挖掘了几个小时后,我没有得到任何结果:

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

wstring_convert<codecvt_utf8_utf16<char16_t>,char16_t> convert;
string dest = convert.to_bytes(std);

返回噩梦堆栈跟踪开始于:

error: no matching function for call to ‘std::wstring_convert<std::codecvt_utf8_utf16<char16_t>, char16_t>::to_bytes(std::string&)

我希望有另一种方式。

最佳答案

首先,您对 std::wstring_convert 的使用是倒退的。您有一个 UTF-8 编码的 std::string,您希望将其转换为宽 Unicode 字符串。您收到编译器错误是因为 to_bytes() 没有将 std::string 作为输入。它需要一个 std::wstring_convert::wide_string 作为输入(在你的例子中是 std::u16string,因为你使用了 char16_t在特化中),所以你需要使用 from_bytes() 而不是 to_bytes():

std::string std = "\u0418\u043d\u0434\u0435\u043a\u0441";

std::wstring_convert<codecvt_utf8_utf16<char16_t>, char16_t> convert;
std::u16string dest = convert.from_bytes(std);

现在,话虽这么说,JSON specification 的第 9 节状态:

9 String

A string is a sequence of Unicode code points wrapped with quotation marks (U+0022). All characters may be placed within the quotation marks except for the characters that must be escaped: quotation mark (U+0022), reverse solidus (U+005C), and the control characters U+0000 to U+001F. There are two-character escape sequence representations of some characters.

\" represents the quotation mark character (U+0022).

\\ represents the reverse solidus character (U+005C).

\/ represents the solidus character (U+002F).

\b represents the backspace character (U+0008).

\f represents the form feed character (U+000C).

\n represents the line feed character (U+000A).

\r represents the carriage return character (U+000D).

\t represents the character tabulation character (U+0009).

So, for example, a string containing only a single reverse solidus character may be represented as "\\".

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. Hexadecimal digits can be digits (U+0030 through U+0039) or the hexadecimal letters A through F in uppercase (U+0041 through U+0046) or lowercase (U+0061 through U+0066). So, for example, a string containing only a single reverse solidus character may be represented as "\u005C".

The following four cases all produce the same result:

"\u002F"

"\u002f"

"\/"

"/"

To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E".

原始 JSON 数据本身可能以 UTF-8(最常见的编码)、UTF-16 等编码。但无论使用何种编码,字符序列 "\u0418\u043d\u0434\u0435\u043a\u0441"表示UTF-16编码单元序列U+0418 U+043d U+0434 U+0435 U+043a U+0441,即Unicode字符串“Индекс”

如果您使用实际的 JSON 解析器(例如 JSON for Modern C++jsoncppRapidJSON 等),它将为您解析 UTF-16 代码单元值并返回可读的 Unicode 字符串。

但是,如果您手动处理 JSON 数据,则必须手动解码任何 \x\uXXXX 转义序列。 std::wstring_convert 不能为你做那件事。它只能将 JSON 从 std::string 转换为 std::wstring/std::u16string,如果这样更容易你来解析数据。但是,您仍然需要单独解析 JSON 的内容

之后,如果需要,您可以使用 std::wstring_convert 转换任何提取的 std::wstring/std::u16string将字符串转换回 UTF-8 以节省内存。

关于c++ - 将 utf8 实体从 json 解码为 utf8 C++,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40793252/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com