gpt4 book ai didi

C 统一码 : How do I apply C11 standard amendment DR488 fix to C11 standard function c16rtomb()?

转载 作者:太空宇宙 更新时间:2023-11-04 03:14:37 29 4
gpt4 key购买 nike

问题:

如函数的 C 引用页所述,c16rtomb,来自 CPPReference ,在注释部分下:

In C11 as published, unlike mbrtoc16, which converts variable-width multibyte (such as UTF-8) to variable-width 16-bit (such as UTF-16) encoding, this function can only convert single-unit 16-bit encoding, meaning it cannot convert UTF-16 to UTF-8 despite that being the original intent of this function. This was corrected by the post-C11 defect report DR488.

在这段话的下方,C 引用页面提供了一个示例源代码,上面有以下句子:

Note: this example assumes the fix for the defect report 488 is applied.

这句话暗示有一种方法可以采用 DR488 并以某种方式将修复程序“应用”到 C11 标准函数 c16rtomb

我想知道如何为 GCC 应用修复程序。因为在我看来,从 v141 开始,该修复程序已应用于 Visual Studio 2017 Visual C++。

在 GCC 中看到的行为,在 GDB 中调试代码时,与在 DR488 中发现的一致,如下所示:

Section 7.28.1 describes the function c16rtomb(). In particular, it states "When c16 is not a valid wide character, an encoding error occurs". "wide character" is defined in section 3.7.3 as "value representable by an object of type wchar_t, capable of representing any character in the current locale". This wording seems to imply that, e.g. for the common cases (e.g, an implementation that defines __STDC_UTF_16__ and a program that uses an UTF-8 locale), c16rtomb() will return -1 when it encounters a character that is encoded as multiple char16_t (for UTF-16 a wide character can be encoded as a surrogate pair consisting of two char16_t). In particular, c16rtomb() will not be able to process strings generated by mbrtoc16().

粗体文字是所描述的行为。

源代码:

#include <stdio.h>
#include <uchar.h>

#define __STD_UTF_16__

int main() {
char16_t* ptr_string = (char16_t*) u"我是誰";

//C++ disallows variable-length arrays.
//GCC uses GNUC++, which has a C++ extension for variable length arrays.
//It is not a truly standard feature in C++ pedantic mode at all.
//https://stackoverflow.com/questions/40633344/variable-length-arrays-in-c14
char buffer[64];
char* bufferOut = buffer;

//Must zero this object before attempting to use mbstate_t at all.
mbstate_t multiByteState = {};

//c16 = 16-bit Characters or char16_t typed characters
//r = representation
//tomb = to Multi-Byte Strings
while (*ptr_string) {
char16_t character = *ptr_string;
size_t size = c16rtomb(bufferOut, character, &multiByteState);
if (size == (size_t) -1)
break;
bufferOut += size;
ptr_string++;
}

size_t bufferOutSize = bufferOut - buffer;
printf("Size: %zu - ", bufferOutSize);
for (int i = 0; i < bufferOutSize; i++) {
printf("%#x ", +(unsigned char) buffer[i]);
}

//This statement is used to set a breakpoint. It does not do anything else.
int debug = 0;
return 0;
}

Visual Studio 的输出:

Size: 9 - 0xe6 0x88 0x91 0xe6 0x98 0xaf 0xe8 0xaa 0xb0

GCC 的输出:

Size: 0 -

最佳答案

在 Linux 中,您应该可以通过调用 setlocale(LC_ALL, "en_US.utf8");

来解决这个问题

关于 ideone 的示例

此函数将执行以下操作,如 Microsoft documentation 中所述:

Convert a UTF-16 wide character into a multibyte character in the current locale.

POSIX 文档类似。 __STD_UTF_16__ 在这两个编译器中似乎都没有效果。它应该指定源的编码,应该是 UTF16。它没有指定目的地的编码。

Windows 文档似乎更不一致,因为它似乎暗示 setlocale 是必需的或转换为 ANSI 代码页是一个选项

关于C 统一码 : How do I apply C11 standard amendment DR488 fix to C11 standard function c16rtomb()?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53148386/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com