c++ - Boost.Locale 和 isprint-6ren

c++ - Boost.Locale 和 isprint

转载作者：塔克拉玛干更新时间：2023-11-03 00:39:13

我正在寻找一种方法来显示 UTF-8 字符串，其中的不可打印/无效字符已转义。在 ASCII 时代，我习惯于使用 isprint 来决定字符是按原样打印还是转义。使用 UTF-8，迭代更加困难，但 Boost.Locale 做得很好。然而，我没有在其中找到任何东西来决定某个字符是否可打印，甚至实际上是否有效。

在下面的源代码中，字符串"Hello あニま ➦ 👙 𝕫⊆𝕢\x02\x01\b\xff\xff\xff " 包含一些不可打印的坏人( \b 例如)和其他是普通的无效序列(\xff\xff\xff)。我应该执行什么测试来确定字符是否可打印？

// Based on an example of Boost.Locale.
#include <boost/locale.hpp>
#include <iostream>
#include <iomanip>

int main()
{
  using namespace boost::locale;
  using namespace std;

  generator gen;
  std::locale loc = gen("");
  locale::global(loc); 
  cout.imbue(loc);

  string text = "Hello あにま ➦ 👙 𝕫⊆𝕢 \x02\x01\b \xff\xff\xff ";

  cout << text << endl;

  boundary::ssegment_index index(boundary::character, text.begin(), text.end());

  for (auto p: index)
    {
      cout << '['  << p << '|';
      for (uint8_t c: p)
        cout << std::hex << std::setw(2) << std::setfill('0') << int(c);
      cout << "] ";
    }
  cout << '\n';
}

运行时，它给出

[H|48] [e|65] [l|6c] [l|6c] [o|6f] [ |20] [あ|e38182] [に|e381ab] [ま|e381be]
[ |20] [➦|e29ea6] [ |20] [👙|f09f9199] [ |20] [𝕫|f09d95ab]
[⊆|e28a86] [𝕢|f09d95a2] [ |20] [|02] [|01] |08] [ |20] [??? |ffffff20]

谢谢

最佳答案

Unicode 具有每个代码点的属性，其中包括 general category , 和一份技术报告列出了 regex classifications (阿尔法、图表等)。 unicode print 分类包括制表符，而 std::isprint(使用 C 语言环境)不包括。 print 确实包括字母、标记、数字、标点符号、符号、空格和格式代码点。格式化代码点 do not include CR or LF ，但做包括code points that affect the appearance相邻的字符。我相信这正是您想要的(标签除外)；该规范经过精心设计以支持这些字符属性。

大多数分类函数，如 std::isprint，一次只能给出一个标量值，因此 UTF32 是显而易见的编码选择。遗憾的是，无法保证您的系统支持 UTF32 语言环境，也无法保证 wchar_t 是保存所有 unicode 代码点所需的必要 20 位。因此，我会考虑使用 boost::spirit::char_encoding::unicode如果可以的话，进行分类。它有一个包含所有 unicode 类别的内部表，并实现了正则表达式技术报告中列出的分类。看起来它使用的是较旧的 Unicode 5.2 数据库，但提供了用于生成表格的 C++，并且可以应用于较新的文件。

多字节 UTF8 序列仍需要转换为单独的代码点 (UTF32)，并且您特别提到了跳过无效 UTF8 序列的能力。由于我是一名 C++ 程序员，我决定不必要地向您的屏幕发送垃圾邮件，并实现一个 constexpr UTF8->UTF32 函数:

#include <cstdint>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <boost/range/iterator_range.hpp>
#include <boost/spirit/home/support/char_encoding/unicode.hpp>

namespace {
struct multi_byte_info {
  std::uint8_t id_mask;
  std::uint8_t id_matcher;
  std::uint8_t data_mask;
};

constexpr const std::uint8_t multi_byte_id_mask = 0xC0;
constexpr const std::uint8_t multi_byte_id_matcher = 0x80;
constexpr const std::uint8_t multi_byte_data_mask = 0x3F;
constexpr const std::uint8_t multi_byte_bits = 6;
constexpr const multi_byte_info multi_byte_infos[] = {
    // skip 1 byte info
    {0xE0, 0xC0, 0x1F},
    {0xF0, 0xE0, 0x0F},
    {0xF8, 0xF0, 0x07}};
constexpr const unsigned max_length =
    (sizeof(multi_byte_infos) / sizeof(multi_byte_info));

constexpr const std::uint32_t overlong[] = {0x80, 0x800, 0x10000};
constexpr const std::uint32_t max_code_point = 0x10FFFF;
}

enum class extraction : std::uint8_t { success, failure };

struct extraction_attempt {
  std::uint32_t code_point;
  std::uint8_t bytes_processed;
  extraction status;
};

template <typename Iterator>
constexpr extraction_attempt next_code_point(Iterator position,
                                             const Iterator &end) {
  static_assert(
      std::is_same<typename std::iterator_traits<Iterator>::iterator_category,
                   std::random_access_iterator_tag>{},
      "bad iterator type");

  extraction_attempt result{0, 0, extraction::failure};

  if (end - position) {
    result.code_point = std::uint8_t(*position);
    ++position;
    ++result.bytes_processed;

    if (0x7F < result.code_point) {
      unsigned expected_length = 1;

      for (const auto info : multi_byte_infos) {
        if ((result.code_point & info.id_mask) == info.id_matcher) {
          result.code_point &= info.data_mask;
          break;
        }
        ++expected_length;
      }

      if (max_length < expected_length || (end - position) < expected_length) {
        return result;
      }

      for (unsigned byte = 0; byte < expected_length; ++byte) {
        const std::uint8_t next_byte = *(position + byte);
        if ((next_byte & multi_byte_id_mask) != multi_byte_id_matcher) {
          return result;
        }

        result.code_point <<= multi_byte_bits;
        result.code_point |= (next_byte & multi_byte_data_mask);
        ++result.bytes_processed;
      }

      if (max_code_point < result.code_point) {
        return result;
      }

      if (overlong[expected_length - 1] > result.code_point) {
        return result;
      }
    }

    result.status = extraction::success;
  } // end multi-byte processing

  return result;
}

template <typename Range>
constexpr extraction_attempt next_code_point(const Range &range) {
  return next_code_point(std::begin(range), std::end(range));
}

template <typename T>
boost::iterator_range<T>
next_character_bytes(const boost::iterator_range<T> &range,
                     const extraction_attempt result) {
  return boost::make_iterator_range(range.begin(),
                                    range.begin() + result.bytes_processed);
}

template <std::size_t Length>
constexpr bool test(const char (&range)[Length],
                    const extraction expected_status,
                    const std::uint32_t expected_code_point,
                    const std::uint8_t expected_bytes_processed) {
  const extraction_attempt result =
      next_code_point(std::begin(range), std::end(range) - 1);
  switch (expected_status) {
  case extraction::success:
    return result.status == extraction::success &&
           result.bytes_processed == expected_bytes_processed &&
           result.code_point == expected_code_point;
  case extraction::failure:
    return result.status == extraction::failure &&
           result.bytes_processed == expected_bytes_processed;
  default:
    return false;
  }
}

int main() {
  static_assert(test("F", extraction::success, 'F', 1), "");
  static_assert(test("\0", extraction::success, 0, 1), "");
  static_assert(test("\x7F", extraction::success, 0x7F, 1), "");
  static_assert(test("\xFF\xFF", extraction::failure, 0, 1), "");

  static_assert(test("\xDF", extraction::failure, 0, 1), "");
  static_assert(test("\xDF\xFF", extraction::failure, 0, 1), "");
  static_assert(test("\xC1\xBF", extraction::failure, 0, 2), "");
  static_assert(test("\xC2\x80", extraction::success, 0x80, 2), "");
  static_assert(test("\xDF\xBF", extraction::success, 0x07FF, 2), "");

  static_assert(test("\xEF\xBF", extraction::failure, 0, 1), "");
  static_assert(test("\xEF\xBF\xFF", extraction::failure, 0, 2), "");
  static_assert(test("\xE0\x9F\xBF", extraction::failure, 0, 3), "");
  static_assert(test("\xE0\xA0\x80", extraction::success, 0x800, 3), "");
  static_assert(test("\xEF\xBF\xBF", extraction::success, 0xFFFF, 3), "");

  static_assert(test("\xF7\xBF\xBF", extraction::failure, 0, 1), "");
  static_assert(test("\xF7\xBF\xBF\xFF", extraction::failure, 0, 3), "");
  static_assert(test("\xF0\x8F\xBF\xBF", extraction::failure, 0, 4), "");
  static_assert(test("\xF0\x90\x80\x80", extraction::success, 0x10000, 4), "");
  static_assert(test("\xF4\x8F\xBF\xBF", extraction::success, 0x10FFFF, 4), "");
  static_assert(test("\xF7\xBF\xBF\xBF", extraction::failure, 0, 4), "");

  static_assert(test("𝕫", extraction::success, 0x1D56B, 4), "");

  constexpr const static char text[] =
      "Hello あにま ➦ 👙 𝕫⊆𝕢 \x02\x01\b \xff\xff\xff ";

  std::cout << text << std::endl;

  auto data = boost::make_iterator_range(text);
  while (!data.empty()) {
    const extraction_attempt result = next_code_point(data);
    switch (result.status) {
    case extraction::success:
      if (boost::spirit::char_encoding::unicode::isprint(result.code_point)) {
        std::cout << next_character_bytes(data, result);
        break;
      }

    default:
    case extraction::failure:
      std::cout << "[";
      std::cout << std::hex << std::setw(2) << std::setfill('0');
      for (const auto byte : next_character_bytes(data, result)) {
        std::cout << int(std::uint8_t(byte));
      }
      std::cout << "]";
      break;
    }

    data.advance_begin(result.bytes_processed);
  }

  return 0;
}

输出:

Hello あにま ➦ 👙 𝕫⊆𝕢  ��� 
Hello あにま ➦ 👙 𝕫⊆𝕢 [02][01][08] [ff][ff][ff] [00]

如果我的 UTF8->UTF32 实现让您感到害怕，或者如果您需要对用户语言环境的支持:

std::mbtoc32
- 令人印象深刻，因为它是最明显的选择，但尚未在 libstdc++ 或 libc++ 中实现(也许是主干构建？)
- 不可重入(当前语言环境和突然改变到别处)
iterators provided by boost .
- 抛出无效序列使其无法使用(无法通过错误序列)。
boost::locale::conv和 C++11 std::codecvt
- 旨在转换编码范围。
- 需要将 UTF32 输出到控制台(更改区域设置)，或者一次转换一个字符以将源字节与 UTF32 值匹配。
UTF8-CPP utf::next(和非抛出 utf8::internal::validate_next)。
- IMO 两者 inconsistently update the iterator position .如果该函数未通过某些健全性检查，则迭代器位置是表示错误代码点的有效 utf8 序列的最后一个字节。文档说:

it: a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.

这并不表示对异常的副作用(肯定有一些)。

关于c++ - Boost.Locale 和 isprint，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/26676977/

文章推荐： c++ - Kinect:从色彩空间到世界坐标

文章推荐： android - 检查wifi扫描是否完成

文章推荐： android - 声明 Activity 的目的是什么？

文章推荐： c++ - 与 OS X 10.10 的链接问题

c++ - 当 "local"、 "global"和 "local"变量存在同名时如何访问 "very local"变量
int i = 1; int main() { int i = 2; { int i = 3; cout 值为 3)。您能做的最好的事情就是在它仍在范
localization - Angularjs 和 $locale
我可以手动为某些应用程序设置 $locale 吗？支持本地化的唯一方法是否可能是包含当前语言环境的 Angular 库中的本地化文件。如果存在多种文化怎么办？在这种情况下我必须动态加载本地化文件？我
local - CUPS @LOCAL 值
我有两台机器。一个使用 CUPS 1.5.0，另一个使用 CUPS 1.6.1。两台机器位于同一本地网络上。我想要完全发现网络上的打印机。如果我运行以下命令: CUPS_DEBUG_LEVEL=2 /
local - 使用连接池时关闭 "local"OrientDB
所以我基本上是这样做的。 OObjectDatabaseTx result = OObjectDatabasePool.global().acquire( "local:orientdb", "adm
javascript - ““Meteor - tsega/meteor-bootstrap3-datetimepicker 类型错误 : locale() locale it is not loaded from moment locales! “
控制台日志重新显示此错误 tsega/meteor-bootstrap3-datetimepicker TypeError: locale() locale it is not loaded from
javascript - express 4。 app.locals、res.locals 和 req.app.locals 之间有什么区别？
我在使用 express 4 时很困惑。我使用 express-generator 来生成我的项目。根目录下有app.js，路由器文件有index.js。但是网上关于express的教程都是直接在
android - SimpleDateFormat(String template, Locale locale)，例如 Locale.US 用于 ASCII 日期
问题:直接使用 SimpleDateFormat，无需明确的语言环境Id:SimpleDateFormat SimpleDateFormat format = new SimpleDateFormat
python - 为什么 locale.strxfrm ("Gè") locale.strxfrm ("Gène")) 的前缀不是 locale "fr_FR.UTF-8"？
这里的代码在 Python 中，但在使用语言环境的 C/C++ 中的行为应该是相同的。 >>> import locale >>> locale.setlocale(locale.LC_ALL, "f
localization - app-localize-behavior 和共享本地化缓存
根据 app-localize-behavior 的 polymer 文档 Each element that displays content to be localized should add
localization - 将小部件移动到另一个文件后，Flutter Localization 功能不起作用？
起初我从 this tutorial 实现 l10n到 Flutter 的模板项目文件，这是成功的。之后，我尝试将 MyHomePage 类移动到名为 home.dart 的新文件中。它停止工作是因为
ERROR: could not load library "/usr/local/pgsql-13/lib/age.so": /usr/local/pgsql-13/lib/age.so: undefined symbol: hash_any_extended(错误：无法加载库“/usr/local/pgsql-13/lib/age.so”：/usr/local/pgsql-13/lib/age.so：未定义符号：HASH_ANY_EXTENDED)
我正在使用源代码中的Postgres 13(Rel_13_STRATE分支)，并且我使用的是来自apachea/age源代码的(Release/PG13/1.3.0分支)中的1.3.0版的Apache
angular - 部署/运行: local Express web-server and local client-side angular app that sends ajax requests to this local web-server
我有: 基于节点Express的Web服务器，应仅在用户的本地计算机上运行一个 Angular 客户端应用程序，它将GET Http请求发送到该本地Web服务器以获取JSON中的数据并将其显示在浏览
node.js - Express 中间件中的 req.locals vs. res.locals vs. res.data vs. req.data vs. app.locals
问了一些类似的问题，但我的问题是，如果我想传播不同路由中间件的中间结果，最好的方法是什么？ app.use(f1); app.use(f2); app.use(f3); function f1(req
javascript - 从服务器发送的 locals AND locals._locals (克隆)
我注意到我的本地变量中有从服务器收到的本地变量的副本。例如 Object { settings: "4.2", env: "development", utils: true,
networking - 如何在Powershell中检索Vista的网络状态(例如 “Local Only”， “Local and Internet”)
我的网卡不稳定，尤其是从休眠状态恢复后，有时会掉线。退出对应于Vista的网络状态，在通知区域中显示为“仅限本地”。是否可以通过编程方式检索这些状态值(例如“有限连接”，“仅本地”，“本地和Inter
中间人 - 我如何访问 Locale/Localization/lang 变量
你好想知道在模板中是否有一种简单的方法来访问当前翻译的 lang 字符串。最佳答案您可以使用 I18n.locale 访问它. 所以在 ERB 中...... ...在 HAML 中: = I1
Python:pickling locals()，或者 locals 有一个更轻量级的表兄弟吗？
我在 Django 中工作。在 Django 中，当您渲染模板时，您向其发送一个上下文字典以进行替换。因为我很懒/干，所以我经常使用 locals() 作为快捷方式，而不是发送看起来像 {'my_va
java Locale.Builder setExtension(Locale.UNICODE_LOCALE_EXTENSION
我一直在尝试让 Java 根据语言环境转换数字。偶遇this post这在很大程度上帮助了我预先理解这一点，我设计了自己的方法将数字转换为特定的语言环境(根据关于这个主题的其他混淆讨论) 所以假设我有
hadoop - Rack-local map任务和Data-local map任务有什么区别？
当我运行“hadoop job -status xxx”时，输出以下一些列表。 Rack-local map tasks=124 Data-local map tasks=6 Rack-local m
localization - 网站语言 : use browser locale or IP address
关闭。这个问题是opinion-based .它目前不接受答案。想改善这个问题吗？更新问题，以便可以通过 editing this post 用事实和引文回答问题. 3个月前关闭。 Improve

塔克拉玛干

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

c++ - Boost.Locale 和 isprint