gpt4 book ai didi

c++ - 如何从 UNICODE 应用程序写入 MBCS 文件?

转载 作者:太空宇宙 更新时间:2023-11-04 16:19:47 25 4
gpt4 key购买 nike

我的问题似乎让人们感到困惑。这是一些具体的东西:

我们的代码执行以下操作:

FILE * fout = _tfsopen(_T("丸穴種類.txt"), _T("w"), _SH_DENYNO);
_fputts(W2T(L"刃物種類\n"), fout);
fclose(fout);

在 MBCS 构建目标下,上面的代码为代码页 932 生成了一个正确编码的文件(假设运行时 932 是系统默认代码页)。

在 UNICODE 构建目标下,上面生成了一个充满 ???? 的垃圾文件。

我想定义一个符号,或者使用编译器开关,或者包含一个特殊的头文件,或者链接到给定的库,以便在构建目标为 UNICODE 时使上述内容继续工作而不更改源代码。

这是过去存在的问题:

FILE* streams can be opened in t(ranslated) or b(inary) modes. Desktop applications can be compiled for UNICODE or MBCS (under Windows).

If my application is compiled for MBCS, then writing MBCS strings to a "wt" stream results in a well-formed text file containing MBCS text for the system code page (i.e. the code page "for non Unicode software").

Because our software generally uses the _t versions of most string & stream functions, in MBCS builds output is handled primarily by puts(pszMBString) or something similar putc etc. Since pszMBString is already in the system code page (e.g. 932 when running on a Japanese machine), the string is written out verbatim (although line terminators are massaged by puts and gets automatically).

However, if my application is compiled for UNICODE, then writing MBCS strings to a "wt" stream results in garbage (lots of "?????" characters) (i.e. I convert the UNICODE to the system's default code page and then write that to the stream using, for example, fwrite(pszNarrow, 1, length, stream)).


I can open my streams in binary mode, in which case I'll get the correct MBCS text... but, the line terminators will no longer be PC-style CR+LF, but instead will be UNIX-style LF only. This, because in binary (non-translated) mode, the file stream doesn't handle the LF->CR+LF translation.


But what I really need, is to be able to produce the exact same files I used to be able to produce when compiling for MBCS: correct line terminators and MBCS text files using the system's code page.

Obviously I can manually adjust the line terminators myself and use binary streams. However, this is a very invasive approach, as I now have to find every bit of code throughout the system that writes text files, and alter it so that it does all of this correctly. What blows my mind, is that UNICODE target is stupider / less capable than the MBCS target we used to use! Surely there is a way to toggle the C library to say "output narrow strings as-is but handle line terminators properly, exactly as you'd do in MBCS builds"?!

最佳答案

遗憾的是,这是一个巨大的话题,值得专门写一本小书。那本书基本上需要为每个希望构建的目标平台(Linux、Windows [flavor]、Mac 等)专门写一章。

我的回答只涉及 Windows 桌面应用程序,为 C++ 编译,有或没有 MFC。请注意:这与希望使用系统默认代码页(即非 Unicode 软件的代码页)从 UNICODE 构建中读入和写出 MBCS(窄)文件有关。 如果要从 UNICODE 构建中读取和写入 Unicode 文件,则必须以二进制模式打开文件,并且必须手动处理 BOM 和换行符转换(即在输入时,必须跳过 BOM(如果有),并且两者将外部编码转换为 Windows Unicode [即 UTF-16LE],并将任何 CR+LF 序列仅转换为 LF;对于输出,您必须编写 BOM(如果有),并将 UTF-16LE 转换为任何目标编码你想要的,再加上你必须将 LF 转换为 CR+LF 序列才能使其成为格式正确的 PC 文本文件)。

注意 MS 的标准 C 库的 puts 和 gets 以及 fwrite 等等,如果以文本/翻译模式打开,它们将在写入时将任何 0x0D 转换为 0x0A 0x0D 序列,在读取时反之亦然,无论无论您是读取或写入单个字节、宽字符还是随机二进制数据流——都无关紧要,所有这些功能归结为在文本/翻译模式下进行盲字节转换! !!

另请注意,许多 Windows API 函数在内部使用 CP_ACP,对其行为没有任何外部控制(例如 WritePrivateProfileString())。因此,人们可能希望确保所有库都使用相同的字符区域设置运行的原因:CP_ACP 而不是其他某个区域设置,因为您无法控制某些功能行为,您被迫遵守他们的选择或不使用

如果使用 MFC,需要:

// force CP_ACP *not* CP_THREAD_ACP for MFC CString auto-conveters!!!
// this makes MFC's CString and CStdioFile and other interfaces use the
// system default code page, instead of the thread default code page (which is normally "c")
#define _CONVERSION_DONT_USE_THREAD_LOCALE

对于 C++ 和 C 库,必须告诉库使用系统代码页:

// force C++ and C libraries based on setlocale() to use system locale for narrow strings
// (this automatically calls setlocale() which makes the C library do the same thing as C++ std lib)
// we only change the LC_CTYPE, not collation or date/time formatting
std::locale::global(std::locale(str(boost::format(".%||") % GetACP()).c_str(), LC_CTYPE));

在包含任何其他 header 之前,我在所有预编译 header 中执行了#define。我在 main(或其等价物)中设置了全局语言环境,一次用于整个程序(您可能需要为将要执行 I/O 或字符串转换的每个线程调用它)。

构建目标是 UNICODE,对于我们的大部分 I/O,我们在通过 CStringA(my_wide_string) 输出之前使用显式字符串转换。

另一件应该注意的事情是,在 VS C++ 下的 C 标准库中有两组不同的多字节函数 - 一组使用线程的区域设置进行操作,另一组使用称为 _setmbcp()(您可以通过 _getmbcp() 查询。这是用于所有窄字符串解释的实际代码页(不是语言环境)(注意:这始终是初始化为 CP_ACP,即 GetACP() 由 VS C++ 启动代码)。

有用的引用资料:
- the-secret-family-split-in-windows-code-page-functions
- Sorting it all out (explains that there are four different locales in effect in Windows)
- MS offers some functions that allow you to set the encoding to use directly, but I didn't explore them
- An important note about a change to MFC that caused it to no longer respect CP_ACP, but rather CP_THREAD_ACP by default starting in MFC 7.0
- Exploration of why console apps in Windows are extreme FAIL when it comes to Unicode I/O
- MFC/ATL narrow/wide string conversion macros (which I don't use, but you may find useful)
- Byte order marker, which you need to write out for Unicode files of any encoding to be understood by other Windows software

关于c++ - 如何从 UNICODE 应用程序写入 MBCS 文件?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18359750/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com