c++ - 如何通过 C++ 中的属性/标识符为 PEGTL 定义 unicode 范围-6ren

c++ - 如何通过 C++ 中的属性/标识符为 PEGTL 定义 unicode 范围

转载作者：搜寻专家更新时间：2023-10-31 01:30:06

26

4

使用 PEGTL ( https://github.com/taocpp/PEGTL )，它是一个基于模板的 C++11 header-only PEG 库，我可以像这样定义 unicode 字符的范围:

utf8::range<0x0, 0x10FF>//所有UTF8字符
utf8::range<0x41, 0x5A, 0x61, 0x7A>//UTF8 0x41-0x5A[A-Z] 和 0x61-0x7A[a-z]

现在使用 UTF8 有这个属性分类 ( https://en.wikipedia.org/wiki/Unicode_character_property#General_Category )，我可以用它做 [:Lu:] 或 [:ID_Start:] 之类的事情并获得一组/范围的字符。

现在，因为我正在使用 C++ 模板，所以我在编译时需要这些范围。在我看来，我有以下选择:

发现 PEGTL 本身有可能查找 [:ID_Start:] 或 [:Lu:]
找到一个允许在编译时进行此类查询的 C++ 预处理器库
获取应用程序/在线服务，我可以在其中执行这些查询并获取范围(如上所示)，然后我可以将其粘贴到我的代码中。

这也代表了我喜欢的解决方案的顺序。

最佳答案

PEGTL使用规则来匹配字符，而不是返回字符集。如果您想匹配具有某些 Unicode 字符属性的字符，您可以创建一个 custom rule并在一些 Unicode 库的帮助下实现它，例如ICU .它提供了测试各种属性代码点的方法，请参阅 this link .

这是一个完整的示例程序:

#include <iomanip>
#include <iostream>

#include <unicode/uchar.h>

#include <tao/pegtl.hpp>

using namespace tao::TAO_PEGTL_NAMESPACE;  // NOLINT

namespace test
{
   template< UProperty P >
   struct icu_has_binary_property
   {
      using analyze_t = analysis::generic< analysis::rule_type::ANY >;

      template< typename Input >
      static bool match( Input& in )
      {
         // this assumes the input is UTF8, adapt as necessary
         const auto r = internal::peek_utf8::peek( in );
         // if a code point is available, the size is >0
         if( r.size != 0 ) {
            // check the property
            if( u_hasBinaryProperty( r.data, P ) ) {
               // if it matches, consume the character
               in.bump( r.size );
               return true;
            }
         }
         return false;
      }
   };

   using icu_lower = icu_has_binary_property< UCHAR_LOWERCASE >;
   using icu_upper = icu_has_binary_property< UCHAR_UPPERCASE >;

   // clang-format off
   struct grammar : seq< icu_upper, plus< icu_lower >, eof > {};
   // clang-format on
}

int main( int argc, char** argv )
{
   for( int i = 1; i < argc; ++i ) {
      argv_input<> in( argv, i );
      std::cout << argv[ i ] << " matches: " << std::boolalpha << parse< test::grammar >( in ) << std::endl;
   }
}

现在我可以编译并运行它了:

$ g++ -std=c++11 -pedantic -Wall -Wextra -Werror -O3 -Ipegtl/include icu.cpp -licuuc -licudata -o icu
$ ./icu Ďánîel DánÎel
Ďánîel matches: true
DánÎel matches: false
$

编辑:我添加了 ICU rules (很多)到 PEGTL。因为它们需要 ICU，一个外部依赖项，所以我将它们放在 contrib 部分。

关于c++ - 如何通过 C++ 中的属性/标识符为 PEGTL 定义 unicode 范围，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48570874/

26

4

0

文章推荐： c++ - 如何在特定的内存位置写入数据？ C++

文章推荐： c++ - 创建类似目录树的 Windows 资源管理器

c++ - pegtl - 如何跳过整个语法的空格
我正在尝试使用 PEGTL 解析一种非常简单的语言。我想我已经找到了问题，但不明白为什么；空格不会被忽略。我知道必须不能忽略空格，这样缩进感知语言也可以被解析。但是我找不到一种默认情况下“吃掉”空格的
c++ - 使用 pegtl 语法正确处理状态
我对 peg 和 pegtl 很陌生，所以我可能遗漏了一些东西。我的语法与以下语法非常相似: using namespace tao::pegtl; struct A : one { }; struc
c++ - 如何通过 C++ 中的属性/标识符为 PEGTL 定义 unicode 范围
使用 PEGTL ( https://github.com/taocpp/PEGTL )，它是一个基于模板的 C++11 header-only PEG 库，我可以像这样定义 unicode 字符的范

首页

博学

6Ren·AI

商城

c++ - 如何通过 C++ 中的属性/标识符为 PEGTL 定义 unicode 范围