gpt4 book ai didi

使用utf8proc将c++字符串转换为utf8有效字符串

转载 作者:太空狗 更新时间:2023-10-29 23:18:44 34 4
gpt4 key购买 nike

我有一个 std::string 输出。我想使用 utf8proc 将其转换为有效的 utf8 字符串。 http://www.public-software-group.org/utf8proc-documentation

typedef int int32_t;
#define ssize_t int
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options)
Reencodes the sequence of unicode characters given by the pointer buffer and length as UTF-8. The result is stored in the same memory area where the data is read. Following flags in the options field are regarded: (Documentation missing here) In case of success the length of the resulting UTF-8 string is returned, otherwise a negative error code is returned.
WARNING: The amount of free space being pointed to by buffer, has to exceed the amount of the input data by one byte, and the entries of the array pointed to by str have to be in the range of 0x0000 to 0x10FFFF, otherwise the program might crash!

那么首先,我如何在末尾添加一个额外的字节?那么如何从 std::string 转换为 int32_t *buffer?

这不起作用:

std::string g = output();
fprintf(stdout,"str: %s\n",g.c_str());
g += " "; //add an extra byte??
g = utf8proc_reencode((int*)g.c_str(), g.size()-1, 0);
fprintf(stdout,"strutf8: %s\n",g.c_str());

最佳答案

您很可能实际上并不需要 utf8proc_reencode() - 该函数采用有效的 UTF-32 缓冲区并将其转换为有效的 UTF-8 缓冲区,但既然您说您不需要知道您的数据采用何种编码方式,那么您就无法使用该功能。

所以,首先你需要弄清楚你的数据实际上是什么编码。你可以使用http://utfcpp.sourceforge.net/使用 utf8::is_valid(g.begin(), g.end()) 测试您是否已经拥有有效的 UTF-8。如果这是真的,你就完成了!

如果为假,事情会变得复杂...但是 ICU ( http://icu-project.org/ ) 可以帮助您;见http://userguide.icu-project.org/conversion/detection

一旦您在某种程度上可靠地知道了数据的编码方式,ICU 可以再次帮助您将其转换为 UTF-8。例如,假设您的源数据 g 是 ISO-8859-1:

UErrorCode err = U_ZERO_ERROR; // check this after every call...
// CONVERT FROM ISO-8859-1 TO UChar
UConverter *conv_from = ucnv_open("ISO-8859-1", &err);
std::vector<UChar> converted(g.size()*2); // *2 is usually more than enough
int32_t conv_len = ucnv_toUChars(conv_from, &converted[0], converted.size(), g.c_str(), g.size(), &err);
converted.resize(conv_len);
ucnv_close(conv_from);
// CONVERT FROM UChar TO UTF-8
g.resize(converted.size()*4);
UConverter *conv_u8 = ucnv_open("UTF-8", &err);
int32_t u8_len = ucnv_fromUChars(conv_u8, &g[0], g.size(), &converted[0], converted.size(), &err);
g.resize(u8_len);
ucnv_close(conv_u8);
之后你的 g 现在保存着 UTF-8 数据。

关于使用utf8proc将c++字符串转换为utf8有效字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13047927/

34 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com