gpt4 book ai didi

c++ - 如何通过指针读取 UTF-8 字符?

转载 作者:塔克拉玛干 更新时间:2023-11-02 23:09:12 32 4
gpt4 key购买 nike

假设我在内存中存储了 UTF-8 内容,我如何使用指针读取字符?我想我需要注意表示多字节字符的第 8 位,但我该如何将序列转换为有效的 Unicode 字符呢?另外,wchar_t 是存储单个 Unicode 字符的正确类型吗?

这是我的想法:


wchar_t readNextChar (char*& p)
{
wchar_t unicodeChar;
char ch = *p++;

if ((ch & 128) != 0)
{
// This is a multi-byte character, what do I do now?
// char chNext = *p++;
// ... but how do I assemble the Unicode character?
...
}
...
return unicodeChar;
}

最佳答案

您必须将 UTF-8 位模式解码为其未编码的 UTF-32 表示形式。如果您想要实际的 Unicode 代码点,则必须使用 32 位数据类型。

在 Windows 上,wchar_t 不够大,因为它只有 16 位。您必须改用 unsigned intunsigned long。仅在处理 UTF-16 代码单元时才使用 wchar_t

在其他平台上,wchar_t 通常是 32 位的。但是在编写可移植代码时,除非绝对需要(例如 std::wstring),否则您应该远离 wchar_t

尝试更像这样的东西:

#define IS_IN_RANGE(c, f, l)    (((c) >= (f)) && ((c) <= (l)))

u_long readNextChar (char* &p)
{
// TODO: since UTF-8 is a variable-length
// encoding, you should pass in the input
// buffer's actual byte length so that you
// can determine if a malformed UTF-8
// sequence would exceed the end of the buffer...

u_char c1, c2, *ptr = (u_char*) p;
u_long uc = 0;
int seqlen;
// int datalen = ... available length of p ...;

/*
if( datalen < 1 )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/

c1 = ptr[0];

if( (c1 & 0x80) == 0 )
{
uc = (u_long) (c1 & 0x7F);
seqlen = 1;
}
else if( (c1 & 0xE0) == 0xC0 )
{
uc = (u_long) (c1 & 0x1F);
seqlen = 2;
}
else if( (c1 & 0xF0) == 0xE0 )
{
uc = (u_long) (c1 & 0x0F);
seqlen = 3;
}
else if( (c1 & 0xF8) == 0xF0 )
{
uc = (u_long) (c1 & 0x07);
seqlen = 4;
}
else
{
// malformed data, do something !!!
return (u_long) -1;
}

/*
if( seqlen > datalen )
{
// malformed data, do something !!!
return (u_long) -1;
}
*/

for(int i = 1; i < seqlen; ++i)
{
c1 = ptr[i];

if( (c1 & 0xC0) != 0x80 )
{
// malformed data, do something !!!
return (u_long) -1;
}
}

switch( seqlen )
{
case 2:
{
c1 = ptr[0];

if( !IS_IN_RANGE(c1, 0xC2, 0xDF) )
{
// malformed data, do something !!!
return (u_long) -1;
}

break;
}

case 3:
{
c1 = ptr[0];
c2 = ptr[1];

switch (c1)
{
case 0xE0:
if (!IS_IN_RANGE(c2, 0xA0, 0xBF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;

case 0xED:
if (!IS_IN_RANGE(c2, 0x80, 0x9F))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;

default:
if (!IS_IN_RANGE(c1, 0xE1, 0xEC) && !IS_IN_RANGE(c1, 0xEE, 0xEF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}

break;
}

case 4:
{
c1 = ptr[0];
c2 = ptr[1];

switch (c1)
{
case 0xF0:
if (!IS_IN_RANGE(c2, 0x90, 0xBF))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;

case 0xF4:
if (!IS_IN_RANGE(c2, 0x80, 0x8F))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;

default:
if (!IS_IN_RANGE(c1, 0xF1, 0xF3))
{
// malformed data, do something !!!
return (u_long) -1;
}
break;
}

break;
}
}

for(int i = 1; i < seqlen; ++i)
{
uc = ((uc << 6) | (u_long)(ptr[i] & 0x3F));
}

p += seqlen;
return uc;
}

关于c++ - 如何通过指针读取 UTF-8 字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2948308/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com