gpt4 book ai didi

c++ - 使用 UTF-8 或 Latin1 编码将 QString 转换为 QByteArray

转载 作者:塔克拉玛干 更新时间:2023-11-02 23:25:45 25 4
gpt4 key购买 nike

我想将 QString 转换为 utf8 或 latin1 QByteArray,但今天我得到的一切都是utf8。

我正在用高于 0x7f 的 latin1 较高段中的一些字符对此进行测试,德语 ü 就是一个很好的例子。

如果我这样做:

QString name("\u00fc"); // U+00FC = ü
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("ISO 8859-1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

我得到以下输出。

utf8 "ü" "c3bc" 
Latin1 "ü" "c3bc"
ISO 8859-1 "ü" "c3bc"

如您所见,我到处都得到 unicode 0xc3bc,而我希望在第 2 步和第 3 步得到 Latin1 0xfc。

我猜我应该得到这样的东西:

utf8 "ü" "c3bc" 
Latin1 "ü" "fc"
ISO 8859-1 "ü" "fc"

这是怎么回事?

/谢谢


一些字符表的链接:


此代码是在基于 Ubuntu 10.04 的系统上构建和执行的。

$> uname -a
Linux frog 2.6.32-28-generic-pae #55-Ubuntu SMP Mon Jan 10 22:34:08 UTC 2011 i686 GNU/Linux
$> env | grep LANG
LANG=en_US.utf8

如果我尝试使用

utf8.append(name.toUtf8());

我得到了这个输出

utf8 "ü" "c383c2bc" 
Latin1 "ü" "c3bc"
ISO 8859-1 "ü" "c3bc"

所以 latin1 是 unicode 而 utf8 是双重编码...

这一定是依赖于一些系统设置吧?


如果我运行它(无法构建 .name())

qDebug() << "system name:"      << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings();
qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name();

然后我明白了:

system name: "en_US" 
codecForCStrings: 0x0
codecForLocale: "System"

解决方案

如果我指定它是我正在使用的 UTF-8,那么不同的类就会知道这一点,然后就可以了。

QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));

qDebug() << "system name:" << QLocale::system().name();
qDebug() << "codecForCStrings:" << QTextCodec::codecForCStrings()->name();
qDebug() << "codecForLocale:" << QTextCodec::codecForLocale()->name();

QString name("\u00fc");
QByteArray utf8;
utf8.append(name);
qDebug() << "utf8" << name << utf8.toHex();

QByteArray latin1;
latin1.append(name.toLatin1());
qDebug() << "Latin1" << name << latin1.toHex();

QTextCodec *codec = QTextCodec::codecForName("latin1");
QByteArray encodedString = codec->fromUnicode(name);
qDebug() << "ISO 8859-1" << name << encodedString.toHex();

然后我得到这个输出:

system name: "en_US" 
codecForCStrings: "UTF-8"
codecForLocale: "UTF-8"
utf8 "ü" "c3bc"
Latin1 "ü" "fc"
ISO 8859-1 "ü" "fc"

看起来应该如此。

最佳答案

要知道的事情:

  • 执行字符页

C++ 标准中有一个称为执行 字符集的术语,该术语描述字符串和字 rune 字的输出在编译器生成的二进制文件中的内容。您可以在 1.1 Character sets 中阅读相关信息C 预处理器手册1 概述 部分的小节 http://gcc.gnu.org网站。

问题:
"\u00fc" 字符串文字会产生什么结果?

答案:
这取决于执行字符集是什么。对于 gcc(您正在使用的),它默认为 UTF-8,除非您使用 -fexec-charset 选项指定不同的内容。您可以在 3.11 Options Controlling the Preprocessor 中阅读有关此选项和其他控制预处理阶段的选项的信息。 GCC 手册3 GCC 命令选项 的小节 http://gcc.gnu.org地点。现在当我们知道执行字符集是 UTF-8 时,我们知道 "\u00fc" 将被翻译成 UTF-8 编码的 U+00FC Unicode 的代码点是两个字节的序列 0xc3 0xbc

采用char * 调用的QString 构造函数QString QString::fromAscii ( const char * str, int size = -1 )它使用带 void QTextCodec::setCodecForCStrings ( QTextCodec * codec ) 的编解码器集(如果已设置任何编解码器)或执行与 QString QString::fromLatin1 ( const char * str, int size = -1 ) 相同的操作(如果没有设置编解码器)。

问题:
QString 的构造函数将使用什么编解码器来解码它得到的两个字节序列 (0xc3 0xbc)?

答案:
默认情况下,QTextCodec::setCodecForCStrings() 没有设置编解码器,这就是 Latin1 将用于解码字节序列的原因。因为 0xc30xbc 在 Latin 1 中都是有效的,分别代表 Ã 和 ¼(这对你来说应该已经很熟悉了,因为它直接取自 this 对你的回答较早的问题)我们得到带有这两个字符的 QString。

你不应该使用 QDebug 类来输出 ASCII 之外的任何东西.您无法保证会得到什么。

测试程序:

#include <QtCore>

void dbg(char const * rawInput, QString s) {

QString codepoints;
foreach(QChar chr, s) {
codepoints.append(QString::number(chr.unicode(), 16)).append(" ");
}

qDebug() << "Input: " << rawInput
<< ", "
<< "Unicode codepoints: " << codepoints;
}

int main(int argc, char *argv[])
{
QCoreApplication app(argc, argv);

qDebug() << "system name:"
<< QLocale::system().name();

for (int i = 1; i <= 5; ++i) {

switch(i) {

case 1:
qDebug() << "\nWithout codecForCStrings (default is Latin1)\n";
break;
case 2:
qDebug() << "\nWith codecForCStrings set to UTF-8\n";
QTextCodec::setCodecForCStrings(QTextCodec::codecForName("UTF-8"));
break;
case 3:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to UTF-8\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("UTF-8"));
break;
case 4:
qDebug() << "\nWithout codecForCStrings (default is Latin1), with codecForLocale set to Latin1\n";
QTextCodec::setCodecForCStrings(0);
QTextCodec::setCodecForLocale(QTextCodec::codecForName("Latin1"));
break;
}

qDebug() << "codecForCStrings:" << (QTextCodec::codecForCStrings()
? QTextCodec::codecForCStrings()->name()
: "NOT SET");
qDebug() << "codecForLocale:" << (QTextCodec::codecForLocale()
? QTextCodec::codecForLocale()->name()
: "NOT SET");

qDebug() << "\n1. Using QString::QString(char const *)";
dbg("\\u00fc", QString("\u00fc"));
dbg("\\xc3\\xbc", QString("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString("ü"));

qDebug() << "\n2. Using QString::fromUtf8(char const *)";
dbg("\\u00fc", QString::fromUtf8("\u00fc"));
dbg("\\xc3\\xbc", QString::fromUtf8("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromUtf8("ü"));

qDebug() << "\n3. Using QString::fromLocal8Bit(char const *)";
dbg("\\u00fc", QString::fromLocal8Bit("\u00fc"));
dbg("\\xc3\\xbc", QString::fromLocal8Bit("\xc3\xbc"));
dbg("LATIN SMALL LETTER U WITH DIAERESIS", QString::fromLocal8Bit("ü"));
}

return app.exec();
}

Windows XP 上 mingw 4.4.0 的输出:

system name: "pl_PL"

Without codecForCStrings (default is Latin1)

codecForCStrings: "NOT SET"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

With codecForCStrings set to UTF-8

codecForCStrings: "UTF-8"
codecForLocale: "System"

1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "102 13d "
Input: \xc3\xbc , Unicode codepoints: "102 13d "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

Without codecForCStrings (default is Latin1), with codecForLocale set to UTF-8

codecForCStrings: "NOT SET"
codecForLocale: "UTF-8"

1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

Without codecForCStrings (default is Latin1), with codecForLocale set to Latin1

codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "
codecForCStrings: "NOT SET"
codecForLocale: "ISO-8859-1"

1. Using QString::QString(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

2. Using QString::fromUtf8(char const *)
Input: \u00fc , Unicode codepoints: "fc "
Input: \xc3\xbc , Unicode codepoints: "fc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fffd "

3. Using QString::fromLocal8Bit(char const *)
Input: \u00fc , Unicode codepoints: "c3 bc "
Input: \xc3\xbc , Unicode codepoints: "c3 bc "
Input: LATIN SMALL LETTER U WITH DIAERESIS , Unicode codepoints: "fc "

我要感谢来自#qt freenode.org 的thiagocbreakpeppeheinz IRC channel ,用于展示和帮助我理解此处涉及的问题。

关于c++ - 使用 UTF-8 或 Latin1 编码将 QString 转换为 QByteArray,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5288959/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com