gpt4 book ai didi

c++ - 在 C++ 中从 UTF-8 转换为 ISO8859-15

转载 作者:行者123 更新时间:2023-11-30 03:19:57 25 4
gpt4 key购买 nike

我想在 C/C++ 中进行从 UTF-8 到 ISO 8859-15 的转换,而不包括额外的库。

我怎样才能做到这一点?

我找到了以下适用于 ISO 8859-1 的代码,但我不确定如何处理 ISO 8859-15 和 ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) 之间的差异:

std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint <= 255) {
out.append(1, static_cast<char>(codepoint));
}
else {
out.append("?");
}
}
}
return out;
}

最佳答案

我喜欢这段代码。它出奇地短。大多数代码只是处理将多字节序列解码为代码点。一旦代码点被解码,向 ISO-8859-1 的转换就非常简单:

  • 如果它小于或等于 255,它也是一个有效的 ISO-8859-1 字符:out.append(1, static_cast<char>(codepoint));
  • 如果不是,则不能在 ISO-8859-1 中表示,并用问号代替:out.append("?");

因此,要使其适用于 ISO-8859-15,需要更多代码来处理引入 ISO-8859-15 时已被替换的字符(参见 Comparing ISO-8859-1 and ISO-8859-15)。不幸的是,它大大增加了代码大小。

下面的代码应该很容易理解。如果这是主要问题,可以对其进行优化以获得更好的性能。

std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;

if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}

关于c++ - 在 C++ 中从 UTF-8 转换为 ISO8859-15,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53269432/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com