gpt4 book ai didi

c++ - 如何使用boost::spirit解析UTF-8?

转载 作者:塔克拉玛干 更新时间:2023-11-03 00:16:06 26 4
gpt4 key购买 nike

#include <algorithm>
#include <iostream>
#include <string>
#include <vector>

#define BOOST_SPIRIT_UNICODE // We'll use unicode (UTF8) all throughout

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/qi_parse.hpp>
#include <boost/spirit/include/support_standard_wide.hpp>

void parse_simple_string()
{
namespace qi = boost::spirit::qi;
namespace encoding = boost::spirit::unicode;
//namespace stw = boost::spirit::standard_wide;

typedef std::wstring::const_iterator iterator_type;

std::vector<std::wstring> result;
std::wstring const input = LR"(12,3","ab,cd","G,G\"GG","kkk","10,\"0","99987","PPP","你好)";

qi::rule<iterator_type, std::wstring()> key = +(qi::unicode::char_ - qi::lit(L"\",\""));
qi::phrase_parse(input.begin(), input.end(),
key % qi::lit(L"\",\""),
encoding::space,
result);

//std::copy(result.rbegin(), result.rend(), std::ostream_iterator<std::wstring, wchar_t> (std::wcout, L"\n"));
for(auto const &data : result) std::wcout<<data<<std::endl;
}

我研究了这篇文章How to use Boost Spirit to parse Chinese(unicode utf-16)?并按照指南进行操作,但无法解析“你好”这个词

预期的结果应该是

12,3A B C DG,G\"格格kkk10,\"099987购买力平价你好

但实际结果是12,3A B C DG,G\"格格kkk10,\"099987购买力平价

中文单词“你好”解析失败

操作系统是win7 64bits,我的编辑器将文字保存为UTF-8

最佳答案

如果您输入的是 UTF-8,那么您可以尝试使用 Unicode Iterators来自 Boost.Regex .

例如,使用 boost::u8_to_u32_iterator:

A Bidirectional iterator adapter that makes an underlying sequence of UTF8 characters look like a (read-only) sequence of UTF32 characters.

live demo

#include <boost/regex/pending/unicode_iterator.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/range.hpp>
#include <iterator>
#include <iostream>
#include <ostream>
#include <cstdint>
#include <vector>

int main()
{
using namespace boost;
using namespace spirit::qi;
using namespace std;

auto &&utf8_text=u8"你好,世界!";
u8_to_u32_iterator<const char*>
tbegin(begin(utf8_text)), tend(end(utf8_text));

vector<uint32_t> result;
parse(tbegin, tend, *standard_wide::char_, result);
for(auto &&code_point : result)
cout << "&#" << code_point << ";";
cout << endl;
}

输出是:

&#20320;&#22909;&#65292;&#19990;&#30028;&#65281;&#0;

关于c++ - 如何使用boost::spirit解析UTF-8?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/13679669/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com