c++ - 使用 C++ 中的 UTF-16 编码文本截断读取-6ren

c++ - 使用 C++ 中的 UTF-16 编码文本截断读取

转载作者：塔克拉玛干更新时间：2023-11-02 23:47:11

我的目标是将外部输入源转换为通用的 UTF-8 内部编码，因为它与我使用的许多库(如 RE2)兼容并且紧凑。由于我不需要使用纯 ASCII 进行字符串切片，因此 UTF-8 是我的理想格式。现在，我应该能够解码的外部输入格式包括 UTF-16。

为了测试 C++ 中的 UTF-16(大端或小端)读取，我将一个测试 UTF-8 文件转换为 UTF-16 LE 和 UTF-16 BE。该文件是 CSV 格式的简单乱码，包含许多不同的源语言(英语、法语、日语、韩语、阿拉伯语、表情符号、泰语)，以创建一个相当复杂的文件:

"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,","🛂"

UTF-8 示例

现在，用下面的代码解析这个以 UTF-8 编码的文件会产生预期的输出(我知道这个例子主要是人为的，因为我的系统编码是 UTF-8，所以没有实际转换为宽字符然后再转换回来到字节是必需的):

#include <sstream>
#include <locale>
#include <iostream>
#include <fstream>
#include <codecvt>

std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf8<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-8.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

当文件编译并运行时(在 Linux 上，因此系统编码为 UTF-8)，我得到以下输出:

$ g++ utf8.cpp -o utf8 -std=c++14
$ ./utf8
73
"This","佐藤 幹夫","Mêmes","친구"
"ภควา"," كيبورد للكتابة بالعربي","ウゥキュ,","🛂"

UTF-16 示例

但是，当我尝试使用 UTF-16 的类似示例时，我得到了一个被截断的文件，尽管该文件已在文本编辑器、Python 等中正确加载。

#include <fstream>
#include <sstream>
#include <iostream>
#include <locale>
#include <codecvt>
#include <string>


std::wstring readFile(const char* filename)
{
    std::wifstream wif(filename, std::ios::binary);
    wif.imbue(std::locale(wif.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff>));
    std::wstringstream wss;
    wss << wif.rdbuf();
    return wss.str();
}


int main()
{
    std::wstring read = readFile("utf-16.csv");
    std::cout << read.size() << std::endl;

    using convert_type = std::codecvt_utf8<wchar_t>;
    std::wstring_convert<convert_type, wchar_t> converter;
    std::string converted_str = converter.to_bytes( read );
    std::cout << converted_str;

    return 0;
}

当文件编译并运行时(在 Linux 上，因此系统编码为 UTF-8)，我得到以下小端格式的输出:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","PO

对于大端格式，我得到以下信息:

$ g++ utf16.cpp -o utf16 -std=c++14
$ ./utf16
19
"This","OP

有趣的是，CJK 字符应该是 Basic Multilingual Plane 的一部分，但显然没有正确转换，文件被提前截断了。逐行方法也会出现同样的问题。

其他资源

我之前查看了以下资源，最值得注意的是这个 answer ，以及这个 answer .他们的解决方案均未证明对我有效。

其他细节

LANG = en_US.UTF-8
gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2)

任何其他详细信息，我很乐意提供。谢谢。

编辑

Adrian 在评论中提到我应该提供一个 hexdump，它显示为“utf-16le”，即小端 UTF-16 编码文件:

0000000 0022 0054 0068 0069 0073 0022 002c 0022
0000010 4f50 85e4 0020 5e79 592b 0022 002c 0022
0000020 004d 00ea 006d 0065 0073 0022 002c 0022
0000030 ce5c ad6c 0022 000a 0022 0e20 0e04 0e27
0000040 0e32 0022 002c 0022 0020 0643 064a 0628
0000050 0648 0631 062f 0020 0644 0644 0643 062a
0000060 0627 0628 0629 0020 0628 0627 0644 0639
0000070 0631 0628 064a 0022 002c 0022 30a6 30a5
0000080 30ad 30e5 002c 0022 002c 0022 d83d dec2
0000090 0022 000a                              
0000094

qexyn 提到删除了 std::ios::binary 标志，我尝试这样做但没有任何改变。

最后，我尝试使用 iconv 查看这些文件是否有效，同时使用命令行实用程序和 C 模块。

$ iconv -f="UTF-16BE"-t="UTF-8"utf-16be.csv "This","佐藤干夫","Mêmes","친구" “ภควา”，“كيبورد للكتاو بالعربي”，“ウゥキュ”，“🛂”

显然，iconv 对源文件没有问题。这促使我使用 iconv，因为它跨平台、易于使用且经过充分测试，但如果有人对标准库有答案，我会很乐意接受。

最佳答案

所以我仍在等待使用 C++ 标准库的潜在答案，但我没有取得任何成功，所以我编写了一个与 Boost 和 iconv(它们是相当常见的依赖项)一起工作的实现。它由一个头文件和一个源文件组成，适用于上述所有情况，性能相当好，可以接受任何 iconv 编码对，并包装一个流对象以允许轻松集成到现有代码中。由于我是 C++ 的新手，如果您选择自己实现代码，我会测试代码:我远不是专家。

编码.hpp

#pragma once

#include <iostream>

#if defined(_MSC_VER) && (_MSC_VER >= 1020)
# pragma once
#endif

#include <cassert>
#include <iosfwd>            // streamsize.
#include <memory>            // allocator, bad_alloc.
#include <new>
#include <string>
#include <boost/config.hpp>
#include <boost/cstdint.hpp>
#include <boost/detail/workaround.hpp>
#include <boost/iostreams/constants.hpp>
#include <boost/iostreams/detail/config/auto_link.hpp>
#include <boost/iostreams/detail/config/dyn_link.hpp>
#include <boost/iostreams/detail/config/wide_streams.hpp>
#include <boost/iostreams/detail/config/zlib.hpp>
#include <boost/iostreams/detail/ios.hpp>
#include <boost/iostreams/filter/symmetric.hpp>
#include <boost/iostreams/pipeline.hpp>
#include <boost/type_traits/is_same.hpp>
#include <boost/iostreams/filter/zlib.hpp>
#include <iconv.h>

// Must come last.
#ifdef BOOST_MSVC
#   pragma warning(push)
#   pragma warning(disable:4251 4231 4660)     // Dependencies not exported.
#endif
#include <boost/config/abi_prefix.hpp>
#undef small


namespace boost
{
namespace iostreams
{
// CONSTANTS
// ---------

extern const size_t maxUnicodeWidth;

// OBJECTS
// -------


/** @brief Parameters for input and output encodings to pass to iconv.
 */
struct encoded_params {
    std::string input;
    std::string output;

    encoded_params(const std::string &input = "UTF-8",
                   const std::string &output = "UTF-8"):
        input(input),
        output(output)
    {}
};


namespace detail
{
// DETAILS
// -------


/** @brief Base class for the character set conversion filter.
 *  Contains a core process function which converts the source
 *  encoding to the destination encoding.
 */
class BOOST_IOSTREAMS_DECL encoded_base {
public:
    typedef char char_type;
protected:
    encoded_base(const encoded_params & params = encoded_params());

    ~encoded_base();

    int convert(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int copy(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end);

    int process(const char * & src_begin,
                const char * & src_end,
                char * & dest_begin,
                char * & dest_end,
                int /* flushLevel */);

public:
    int total_in();
    int total_out();


private:
    iconv_t conv;
    bool differentCharset;
};


/** @brief Template implementation for the encoded writer.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_writer_impl : public encoded_base {
public:
    encoded_writer_impl(const encoded_params &params = encoded_params());
    ~encoded_writer_impl();
    bool filter(const char*& src_begin, const char* src_end,
                char*& dest_begin, char* dest_end, bool flush);
    void close();
};


/** @brief Template implementation for the encoded reader.
 *
 *  Model of a C-style file filter for character set conversions, via
 *  iconv.
 */
template<typename Alloc = std::allocator<char> >
class encoded_reader_impl : public encoded_base {
public:
    encoded_reader_impl(const encoded_params &params = encoded_params());
    ~encoded_reader_impl();
    bool filter(const char*& begin_in, const char* end_in,
                char*& begin_out, char* end_out, bool flush);
    void close();
    bool eof() const
    {
        return eof_;
    }

private:
    bool eof_;
};



}   /* detail */

// FILTERS
// -------

/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_writer
    : symmetric_filter<detail::encoded_writer_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_writer_impl<Alloc>         impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_writer(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_in() { return this->filter().total_in(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_writer, 1)

typedef basic_encoded_writer<> encoded_writer;


/** @brief Model of InputFilter and OutputFilter implementing
 *  character set conversion via iconv.
 */
template<typename Alloc = std::allocator<char> >
struct basic_encoded_reader
    : symmetric_filter<detail::encoded_reader_impl<Alloc>, Alloc>
{
private:
    typedef detail::encoded_reader_impl<Alloc>       impl_type;
    typedef symmetric_filter<impl_type, Alloc>  base_type;
public:
    typedef typename base_type::char_type               char_type;
    typedef typename base_type::category                category;
    basic_encoded_reader(const encoded_params &params = encoded_params(),
                         int buffer_size = default_device_buffer_size);
    int total_out() { return this->filter().total_out(); }
    bool eof() { return this->filter().eof(); }
};
BOOST_IOSTREAMS_PIPABLE(basic_encoded_reader, 1)

typedef basic_encoded_reader<> encoded_reader;


namespace detail
{
// IMPLEMENTATION
// --------------


/** @brief Initialize the encoded writer with the iconv parameters.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::encoded_writer_impl(const encoded_params& p):
    encoded_base(p)
{}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
encoded_writer_impl<Alloc>::~encoded_writer_impl()
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the writer.
 */
template<typename Alloc>
bool encoded_writer_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
     char*& dest_begin, char* dest_end, bool flush)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, flush);
    return result == -1;
}


/** @brief Close the encoded writer.
 */
template<typename Alloc>
void encoded_writer_impl<Alloc>::close()
{}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::~encoded_reader_impl()
{}


/** @brief Initialize the encoded reader with the iconv parameters.
 */
template<typename Alloc>
encoded_reader_impl<Alloc>::encoded_reader_impl(const encoded_params& p):
    encoded_base(p),
    eof_(false)
{}


/** @brief Implementation of the symmetric, character set encoding filter
 *  for the reader.
 */
template<typename Alloc>
bool encoded_reader_impl<Alloc>::filter
    (const char*& src_begin, const char* src_end,
    char*& dest_begin, char* dest_end, bool /* flush */)
{
    int result = process(src_begin, src_end, dest_begin, dest_end, true);
    return result;
}


/** @brief Close the encoded reader.
 */
template<typename Alloc>
void encoded_reader_impl<Alloc>::close()
{
    // cannot re-open, not a true stream
    //eof_ = false;
    //reset(false, true);
}

}   /* detail */


/** @brief Initializer for the symmetric write filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_writer<Alloc>::basic_encoded_writer
(const encoded_params& p, int buffer_size):
    base_type(buffer_size, p)
{}


/** @brief Initializer for the symmetric read filter, which initializes
 *  the iconv base from the parameters and the buffer size.
 */
template<typename Alloc>
basic_encoded_reader<Alloc>::basic_encoded_reader(const encoded_params &p, int buffer_size):
    base_type(buffer_size, p)
{}


}   /* iostreams */
}   /* boost */

#include <boost/config/abi_suffix.hpp> // Pops abi_suffix.hpp pragmas.
#ifdef BOOST_MSVC
    # pragma warning(pop)
#endif

编码.cpp

#include "encoding.hpp"

#include <iconv.h>

#include <algorithm>
#include <cstring>
#include <string>


namespace boost
{
namespace iostreams
{
namespace detail
{
// CONSTANTS
// ---------

const size_t maxUnicodeWidth = 4;

// DETAILS
// -------


/** @brief Initialize the iconv converter with the source and
 *  destination encoding.
 */
encoded_base::encoded_base(const encoded_params &params)
{
    if (params.output != params.input) {
        conv = iconv_open(params.output.data(), params.input.data());
        differentCharset = true;
    } else {
        differentCharset = false;
    }
}


/** @brief Cleanup the iconv converter.
 */
encoded_base::~encoded_base()
{
    if (differentCharset) {
        iconv_close(conv);
    }
}


/** C-style stream converter, which converts the source
 *  character array to the destination character array, calling iconv
 *  recursively to skip invalid characters.
 */
int encoded_base::convert(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    char *end = dest_end - maxUnicodeWidth;
    size_t srclen, dstlen;
    while (src_begin < src_end && dest_begin < end) {
        srclen = src_end - src_begin;
        dstlen = dest_end - dest_begin;
        char *pIn = const_cast<char *>(src_begin);
        iconv(conv, &pIn, &srclen, &dest_begin, &dstlen);
        if (src_begin == pIn) {
            src_begin++;
        } else {
            src_begin = pIn;
        }
    }

    return 0;
}


/** C-style stream converter, which copies source bytes to output
 *  bytes.
 */
int encoded_base::copy(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end)
{
    size_t srclen = src_end - src_begin;
    size_t dstlen = dest_end - dest_begin;
    size_t length = std::min(srclen, dstlen);

    memmove((void*) dest_begin, (void *) src_begin, length);
    src_begin += length;
    dest_begin += length;

    return 0;
}


/** @brief Processes the input stream through the stream filter.
 */
int encoded_base::process(const char * & src_begin,
                          const char * & src_end,
                          char * & dest_begin,
                          char * & dest_end,
                          int /* flushLevel */)
{
    if (differentCharset) {
        return convert(src_begin, src_end, dest_begin, dest_end);
    } else {
        return copy(src_begin, src_end, dest_begin, dest_end);
    }
}


}   /* detail */
}   /* iostreams */
}   /* boost */

示例程序

#include "encoding.hpp"

#include <boost/iostreams/filtering_streambuf.hpp>
#include <fstream>
#include <string>


int main()
{
    std::ifstream fin("utf8.csv", std::ios::binary);
    std::ofstream fout("utf16le.csv", std::ios::binary);

    // encoding
    boost::iostreams::filtering_streambuf<boost::iostreams::input> streambuf;
    streambuf.push(boost::iostreams::encoded_reader({"UTF-8", "UTF-16LE"}));
    streambuf.push(fin);
    std::istream stream(&streambuf);

    std::string line;
    while (std::getline(stream, line)) {
        fout << line << std::endl;
    }
    fout.close();
}

在上面的示例中，我们将 UTF-8 编码文件的拷贝写入 UTF-16LE，使用流缓冲区将 UTF-8 文本转换为 UTF-16LE，我们将其作为字节写入输出，仅为我们的整个过程添加 4 行(可读)代码。

关于c++ - 使用 C++ 中的 UTF-16 编码文本截断读取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39441805/

文章推荐： c++ - 将 union 与结构内的位字段结合使用的正确语法

文章推荐： android - 从单独的 myJavaClass.java 完成 Activity()

文章推荐： android - 图库 View 不是从左边开始

c++ - C c;之间有什么区别吗？和 C c = C();?
#include using namespace std; class C{ private: int value; public: C(){ value = 0;
c++ - C 风格字符串差异 : C/C++
这个问题已经有答案了: What is the difference between char a[] = ?string?; and char *p = ?string?;? (8 个回答) 已关闭
c++ - c\c++ 转换为 C#
关闭。此题需要details or clarity 。目前不接受答案。想要改进这个问题吗？通过 editing this post 添加详细信息并澄清问题. 已关闭 7 年前。此帖子已于 8 个月
c# - C、C++、C# 的功能测试工具
除了调试之外，是否有任何针对 c、c++ 或 c# 的测试工具，其工作原理类似于将独立函数复制粘贴到某个文本框，然后在其他文本框中输入参数？最佳答案也许您会考虑单元测试。我推荐你谷歌测试和谷歌模拟
c# - C/C++/C# 在监视器上设置窗口位置
我想在第二台显示器中移动一个窗口 (HWND)。问题是我尝试了很多方法，例如将分辨率加倍或输入负值，但它永远无法将窗口放在我的第二台显示器上。关于如何在 C/C++/c# 中执行此操作的任何线索最
c# - C/C++/C#中的DES实现
我正在寻找 C/C++/C## 中不同类型 DES 的现有实现。我的运行平台是Windows XP/Vista/7。我正在尝试编写一个 C# 程序，它将使用 DES 算法进行加密和解密。我需要一些实
c# - 在条件中使用赋值是否安全？ C/C++、C#
很难说出这里要问什么。这个问题模棱两可、含糊不清、不完整、过于宽泛或夸夸其谈，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开，visit the help center . 关闭 1
c++ - C/C++/C# 强制窗口在最上面
有没有办法强制将另一个窗口置于顶部？不是应用程序的窗口，而是另一个已经在系统上运行的窗口。 (Windows, C/C++/C#) 最佳答案 SetWindowPos(that_window_ha
c# - 套接字服务器应用程序的选择 : C/C++ or C#
假设您可以在 C/C++ 或 Csharp 之间做出选择，并且您打算在 Windows 和 Linux 服务器上运行同一服务器的多个实例，那么构建套接字服务器应用程序的最明智选择是什么？最佳答案如
c++ - C/C++ 运行时库和 C/C++ 标准库的区别
你们能告诉我它们之间的区别吗？顺便问一下，有什么叫C++库或C库的吗？最佳答案 C++ 标准库和 C 标准库是 C++ 和 C 标准定义的库，提供给 C++ 和 C 程序使用。那是那些词的共同
c++ - &C::c 和 &(C::c) 有什么区别？
下面的测试代码，我将输出信息放在注释中。我使用的是 gcc 4.8.5 和 Centos 7.2。 #include #include class C { public:
c++ - 什么 C++(通用 (c/c++) 与 (通用 c)/c++ )
很难说出这里问的是什么。这个问题是含糊的、模糊的、不完整的、过于宽泛的或修辞性的，无法以目前的形式得到合理的回答。如需帮助澄清此问题以便重新打开它，visit the help center 。已关
c# - 通过网络在 C/C++ 服务器、C/C++ 和 C# 客户端之间发送数据结构
我的客户将使用名为 annoucement 的结构/类与客户通信。我想我会用 C++ 编写服务器。会有很多不同的类继承annoucement。我的问题是通过网络将这些类发送给客户端我想也许我应该使用
c# - C/C++ - 如何将 Buffer.BlockCopy (C#) 转换为 C/C++
我在 C# 中有以下函数: public Matrix ConcatDescriptors(IList> descriptors) { int cols = descriptors[0].Co
c++ - C/C++ - 对其他人隐藏 C 或 C++ 函数代码
我有一个项目要编写一个函数来对某些数据执行某些操作。我可以用 C/C++ 编写代码，但我不想与雇主共享该函数的代码。相反，我只想让他有权在他自己的代码中调用该函数。是否可以？我想到了这两种方法 - 在
c# - 在托管代码(C++、C、C++/CLI、C#)中使用非托管代码时处理错误
我使用的是编写糟糕的第 3 方 (C/C++) Api。我从托管代码(C++/CLI)中使用它。有时会出现“访问冲突错误”。这使整个应用程序崩溃。我知道我无法处理这些错误[如果指针访问非法内存位置等，
c# - C#、C/C++ 或 Objective-C 中的眼动追踪库
关闭。这个问题不符合Stack Overflow guidelines .它目前不接受答案。我们不允许提问寻求书籍、工具、软件库等的推荐。您可以编辑问题，以便用事实和引用来回答。关闭 7 年前。
c++ - C/C++/Objective-C 文本识别库
已关闭。此问题不符合Stack Overflow guidelines 。目前不接受答案。要求我们推荐或查找工具、库或最喜欢的场外资源的问题对于 Stack Overflow 来说是偏离主题的，因为
c# - 将 C/C++ 函数导入 C#
我有一些 C 代码，将使用 P/Invoke 从 C# 调用。我正在尝试为这个 C 函数定义一个 C# 等效项。 SomeData* DoSomething(); struct SomeData {
c - C语言中 "c -= --c - c++;"的结果应该是什么？
这个问题已经有答案了: Why are these constructs using pre and post-increment undefined behavior? (14 个回答) 已关闭 6

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - 使用 C++ 中的 UTF-16 编码文本截断读取