ruby - Flex (lexer) - 匹配 unicode-6ren

ruby - Flex (lexer) - 匹配 unicode

转载作者：数据小太阳更新时间：2023-10-29 07:39:14

有没有办法让 flex 匹配 unicode

ascSymbol     !|#|$|%|&|⋆|+|.|/|<|=|>|?|@|\|^|-|~|:
uniSymbol     \p{Symbol}|\p{Other_Symbol}|\p{Punctuation}
symbol        ascSymbol|uniSymbol{-}[^|_"',;]

我找到了 http://lists.gnu.org/archive/html/help-flex/2005-01/msg00043.html通过Flex(lexer) support for unicode但我希望能够以自动化的方式进行某些操作。

例如，我正在使用 cmake，它被配置为在构建时从 *.l 和 *.y 文件生成词法分析器/解析器。理想情况下，我想要一个不需要安装 GHC 或其他 Haskell 编译器的解决方法。

也愿意接受关于另一个与 Bison 集成并支持 unicode 的词法分析器的建议....

最佳答案

事实证明，在 Flex 中获得 unicode 支持会很痛苦，除非 Flex 源代码本身添加它。那里似乎有一些针对 unicode 的实验性内容，但从未将其纳入我能找到的版本。

Ragel doc 很有见地，内置了对 Unicode 的支持。从那以后，我发现了 this article，它给出了如何让 Ragel 和 C++ 更好地发挥作用的示例。似乎是更好的选择，所以继续吧。

希望这可以节省其他人解决这个问题的时间。

编辑

上面所说的“内置支持”可能有点夸张。获得 unicode 支持变得更加容易，但它不仅仅是开箱即用的东西。我使用 cmake 从派生的 UCD 7 文件生成状态机。在 CMakeLists.txt 中我这样做:

#Ruby is required to generate a unicode Ragel machine
FIND_PACKAGE(Ruby REQUIRED)
MESSAGE("Found Ruby ${RUBY_VERSION}")
SET(UNICODE_MACHINE_PATH "${PROJECT_SOURCE_DIR}/src/unicode.rl")
if(NOT EXISTS ${UNICODE_MACHINE_PATH} OR gen_unicode)

MESSAGE("Attempting to generate unicode state machine")
EXECUTE_PROCESS(COMMAND ${RUBY_EXECUTABLE}  ${PROJECT_SOURCE_DIR}/unicode2ragel.rb
                OUTPUT_FILE ${UNICODE_MACHINE_PATH}
                RESULT_VARIABLE RAGEL_UNICODE_GEN_RES)

  if(${RAGEL_UNICODE_GEN_RES} EQUAL 0)
    MESSAGE("Generaged Ragel Unicode state machine")
  else()
    MESSAGE(SEND_ERROR "Unable to generate unicode state machine")
  endif()
endif()

然后在 unicode2ragel.rb 中(与 Ragel 一起发布并针对 UCD 7 稍作修改)

#!/usr/bin/env ruby
#
# This script uses the unicode spec to generate a Ragel state machine
# that recognizes unicode alphanumeric characters.  It generates 5
# character classes: uupper, ulower, ualpha, udigit, and ualnum.
# Currently supported encodings are UTF-8 [default] and UCS-4.
#
# Usage: unicode2ragel.rb [options]
#    -e, --encoding [ucs4 | utf8]     Data encoding
#    -h, --help                       Show this message
#
# This script was originally written as part of the Ferret search
# engine library.
#
# Author: Rakan El-Khalil <rakan@well.com>

require 'optparse'
require 'open-uri'

ENCODINGS = [ :utf8, :ucs4 ]
ALPHTYPES = { :utf8 => "unsigned char", :ucs4 => "unsigned int" }
CHART_URL = "http://www.unicode.org/Public/7.0.0/ucd/extracted/DerivedGeneralCategory.txt"#"http://www.unicode.org/Public/7.0.0/ucd/DerivedCoreProperties.txt"

###
# Display vars & default option

TOTAL_WIDTH = 80
RANGE_WIDTH = 23
@encoding = :utf8

###
# Option parsing

cli_opts = OptionParser.new do |opts|
  opts.on("-e", "--encoding [ucs4 | utf8]", "Data encoding") do |o|
    @encoding = o.downcase.to_sym
  end
  opts.on("-h", "--help", "Show this message") do
    puts opts
    exit
  end
end

cli_opts.parse(ARGV)
unless ENCODINGS.member? @encoding
  puts "Invalid encoding: #{@encoding}"
  puts cli_opts
  exit
end

##
# Downloads the document at url and yields every alpha line's hex
# range and description.

def each_alpha( url, property )
  open( url ) do |file|
    file.each_line do |line|
      next if line =~ /^#/;
      next if line !~ /; #{property} #/;

      range, description = line.split(/;/)
      range.strip!
      description.gsub!(/.*#/, '').strip!

      if range =~ /\.\./
           start, stop = range.split '..'
      else start = stop = range
      end

      yield start.hex .. stop.hex, description
    end
  end
end

###
# Formats to hex at minimum width

def to_hex( n )
  r = "%0X" % n
  r = "0#{r}" unless (r.length % 2).zero?
  r
end

###
# UCS4 is just a straight hex conversion of the unicode codepoint.

def to_ucs4( range )
  rangestr  =   "0x" + to_hex(range.begin)
  rangestr << "..0x" + to_hex(range.end) if range.begin != range.end
  [ rangestr ]
end

##
# 0x00     - 0x7f     -> 0zzzzzzz[7]
# 0x80     - 0x7ff    -> 110yyyyy[5] 10zzzzzz[6]
# 0x800    - 0xffff   -> 1110xxxx[4] 10yyyyyy[6] 10zzzzzz[6]
# 0x010000 - 0x10ffff -> 11110www[3] 10xxxxxx[6] 10yyyyyy[6] 10zzzzzz[6]

UTF8_BOUNDARIES = [0x7f, 0x7ff, 0xffff, 0x10ffff]

def to_utf8_enc( n )
  r = 0
  if n <= 0x7f
    r = n
  elsif n <= 0x7ff
    y = 0xc0 | (n >> 6)
    z = 0x80 | (n & 0x3f)
    r = y << 8 | z
  elsif n <= 0xffff
    x = 0xe0 | (n >> 12)
    y = 0x80 | (n >>  6) & 0x3f
    z = 0x80 |  n        & 0x3f
    r = x << 16 | y << 8 | z
  elsif n <= 0x10ffff
    w = 0xf0 | (n >> 18)
    x = 0x80 | (n >> 12) & 0x3f
    y = 0x80 | (n >>  6) & 0x3f
    z = 0x80 |  n        & 0x3f
    r = w << 24 | x << 16 | y << 8 | z
  end

  to_hex(r)
end

def from_utf8_enc( n )
  n = n.hex
  r = 0
  if n <= 0x7f
    r = n
  elsif n <= 0xdfff
    y = (n >> 8) & 0x1f
    z =  n       & 0x3f
    r = y << 6 | z
  elsif n <= 0xefffff
    x = (n >> 16) & 0x0f
    y = (n >>  8) & 0x3f
    z =  n        & 0x3f
    r = x << 10 | y << 6 | z
  elsif n <= 0xf7ffffff
    w = (n >> 24) & 0x07
    x = (n >> 16) & 0x3f
    y = (n >>  8) & 0x3f
    z =  n        & 0x3f
    r = w << 18 | x << 12 | y << 6 | z
  end
  r
end

###
# Given a range, splits it up into ranges that can be continuously
# encoded into utf8.  Eg: 0x00 .. 0xff => [0x00..0x7f, 0x80..0xff]
# This is not strictly needed since the current [5.1] unicode standard
# doesn't have ranges that straddle utf8 boundaries.  This is included
# for completeness as there is no telling if that will ever change.

def utf8_ranges( range )
  ranges = []
  UTF8_BOUNDARIES.each do |max|
    if range.begin <= max
      return ranges << range if range.end <= max

      ranges << range.begin .. max
      range = (max + 1) .. range.end
    end
  end
  ranges
end

def build_range( start, stop )
  size = start.size/2
  left = size - 1
  return [""] if size < 1

  a = start[0..1]
  b = stop[0..1]

  ###
  # Shared prefix

  if a == b
    return build_range(start[2..-1], stop[2..-1]).map do |elt|
      "0x#{a} " + elt
    end
  end

  ###
  # Unshared prefix, end of run

  return ["0x#{a}..0x#{b} "] if left.zero?

  ###
  # Unshared prefix, not end of run
  # Range can be 0x123456..0x56789A
  # Which is equivalent to:
  #     0x123456 .. 0x12FFFF
  #     0x130000 .. 0x55FFFF
  #     0x560000 .. 0x56789A

  ret = []
  ret << build_range(start, a + "FF" * left)

  ###
  # Only generate middle range if need be.

  if a.hex+1 != b.hex
    max = to_hex(b.hex - 1)
    max = "FF" if b == "FF"
    ret << "0x#{to_hex(a.hex+1)}..0x#{max} " + "0x00..0xFF " * left
  end

  ###
  # Don't generate last range if it is covered by first range

  ret << build_range(b + "00" * left, stop) unless b == "FF"
  ret.flatten!
end

def to_utf8( range )
  utf8_ranges( range ).map do |r|
    build_range to_utf8_enc(r.begin), to_utf8_enc(r.end)
  end.flatten!
end

##
# Perform a 3-way comparison of the number of codepoints advertised by
# the unicode spec for the given range, the originally parsed range,
# and the resulting utf8 encoded range.

def count_codepoints( code )
  code.split(' ').inject(1) do |acc, elt|
    if elt =~ /0x(.+)\.\.0x(.+)/
      if @encoding == :utf8
        acc * (from_utf8_enc($2) - from_utf8_enc($1) + 1)
      else
        acc * ($2.hex - $1.hex + 1)
      end
    else
      acc
    end
  end
end

def is_valid?( range, desc, codes )
  spec_count  = 1
  spec_count  = $1.to_i if desc =~ /\[(\d+)\]/
  range_count = range.end - range.begin + 1

  sum = codes.inject(0) { |acc, elt| acc + count_codepoints(elt) }
  sum == spec_count and sum == range_count
end

##
# Generate the state maching to stdout

def generate_machine( name, property )
  pipe = " "
  puts "    #{name} = "
  each_alpha( CHART_URL, property ) do |range, desc|

    codes = (@encoding == :ucs4) ? to_ucs4(range) : to_utf8(range)

    raise "Invalid encoding of range #{range}: #{codes.inspect}" unless
      is_valid? range, desc, codes

    range_width = codes.map { |a| a.size }.max
    range_width = RANGE_WIDTH if range_width < RANGE_WIDTH

    desc_width  = TOTAL_WIDTH - RANGE_WIDTH - 11
    desc_width -= (range_width - RANGE_WIDTH) if range_width > RANGE_WIDTH

    if desc.size > desc_width
      desc = desc[0..desc_width - 4] + "..."
    end

    codes.each_with_index do |r, idx|
      desc = "" unless idx.zero?
      code = "%-#{range_width}s" % r
      puts "      #{pipe} #{code} ##{desc}"
      pipe = "|"
    end
  end
  puts "      ;"
  puts ""
end

puts <<EOF
# The following Ragel file was autogenerated from: #{CHART_URL}
#
# It defines ualpha, udigit, ualnum.
#
# To use this, make sure that your alphtype is set to #{ALPHTYPES[@encoding]},
# and that your input is in #{@encoding}.

%%{
    machine WChar;
EOF
generate_machine( :uUppercaseLetter, "Lu" )
generate_machine( :uLowercaseLetter, "Ll" )
generate_machine( :uTitlecaseLetter, "Lt" )
generate_machine( :uModifierLetter, "Lm" )
generate_machine( :uOtherLetter, "Lo" )
generate_machine( :uNonspacingMark, "Mn" )
generate_machine( :uEnclosingMark, "Me" )
generate_machine( :uSpacingMark, "Mc" )
generate_machine( :uDecimalNumber, "Nd" )
generate_machine( :uLetterNumber, "Nl" )
generate_machine( :uOtherNumber, "No" )
generate_machine( :uSpaceSeparator, "Zs" )
generate_machine( :uLineSeparator, "Zl" )
generate_machine( :uParagraphSeparator, "Zp" )
generate_machine( :uFormat, "Cf" )
generate_machine( :uPrivateUse, "Co" )
generate_machine( :uSurrogate, "Cs" )
generate_machine( :uDashPunctuation, "Pd" )
generate_machine( :uOpenPunctuation, "Ps" )
generate_machine( :uClosePunctuation, "Pe" )
generate_machine( :uConnectorPunctuation, "Pc" )
generate_machine( :uOtherPunctuation, "Po" )
generate_machine( :uMathSymbol, "Sm" )
generate_machine( :uCurrencySymbol, "Sc" )
generate_machine( :uModifierSymbol, "Sk" )
generate_machine( :uOtherSymbol, "So" )
generate_machine( :uInitialPunctuation, "Pi" )
generate_machine( :uFinalPunctuation, "Pf" )
puts <<EOF
}%%
EOF

然后在你的 ragel 机器文件中，你可以包含 unicode.rl 并访问每个定义的 unicode 组，例如 uUppercaseLetter 等等......

关于ruby - Flex (lexer) - 匹配 unicode，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28929193/

文章推荐： ruby - Gem 在安装 jekyll 时无法构建 native 扩展

文章推荐： ruby-on-rails - 如何保存我的多对多关系？

ruby - Ruby:Ruby Socket对象中的remote_address/local_address存放在哪里？
以下是一个非常简单的ruby服务器。 require 'socket' local_socket = Socket.new(:INET, :STREAM) local_addr = Socket.
ruby - 启动没有前缀 "Ruby"的 Ruby
我正在使用 OS X(使用 bash)，并且是 unix 的新手。我想知道是否可以修改一些文件以便运行 ruby 程序，我不需要“ruby file.rb”，而是可以运行“ruby.rb”。有理
ruby - ruby 如何完成这项任务(Ruby 中不区分大小写的字符串搜索和替换)？
我在用 Ruby 替换字符串时遇到一些问题。我的原文:人之所为不如兽之所为。我想替换为:==What== human does is not like ==what== animal does.
ruby - 从 Ruby 程序执行 Ruby 程序的最佳方式是什么？
我想在一个循环中从 Ruby 脚本做这样的事情: 写一个文件a.rb(每次迭代都会改变) 执行系统(ruby 'a.rb') a.rb 将带有结果的字符串写入文件“results” a.rb 完成并且
ruby-on-rails - Ruby on Rails - 需要为应用程序使用旧版本的 ruby
我的问题是尝试创建一个本地服务器，以便我可以理解由我的新团队开发的应用程序。我的问题是我使用的是 Ruby 2.3.3，而 Gemfile 需要 2.3.1。我无法编辑 Gemfile，因为我被告知很
ruby - 如何为用 Ruby 编写的 Ruby 命令行实用程序提供配置文件？
我有一个使用 GLI 框架用 Ruby 编写的命令行实用程序。我想在我的主目录中配置我的命令行实用程序，使用 Ruby 本身作为 DSL 来处理它(类似于 Gemfile 或 Rakefile)。我
ruby - 什么时候 Ruby 类不是那个 Ruby 类？
我的 Rails 应用 Controller 中有这段代码: def delete object = model.datamapper_class.first(:sourced_id =>
ruby - 您建议使用哪种 Ruby 解析器来解析 Ruby 源代码？
我正在寻找的解析器应该: 对 Ruby 解析友好，规则设计优雅，产生用户友好的解析错误，用户文档的数量应该比计算器示例多， UPD:允许在编写语法时省略可选的空格。快速解析不是一个重要的特性。
ruby - 有哪些设计良好的 Ruby 项目适合学习 Ruby 编码方式？
我刚开始使用 Ruby，听说有一种“Ruby 方式”编码。除了 Ruby on Rails 之外，还有哪些项目适合学习并被认可且设计良好？最佳答案 Prawn被明确地创建为不仅是一个该死的好 PDF
ruby - 如何创建无需在终端中调用 "Ruby"即可运行的 Ruby 应用程序？
我知道之前有人问过类似的问题，但是我该如何构建一个无需在前面输入“ruby”就可以在终端中运行的 Ruby 文件呢？这里的最终目标是创建一个命令行工具包类型的东西。现在，为了执行我希望用户能够执行的
ruby - 有没有更好的方法来判断一个 ruby 是否在另一个 ruby 中散列？
例如哈希a是{:name=>'mike',:age=>27,:gender=>'male'}哈希 b 是 {:name=>'mike'} 我想知道是否有更好的方法来判断 b 哈希是否在 a 哈希内，而
ruby - Ruby 和 Ruby on Rails 中的三层架构
我是一名决定学习 Ruby 和 Ruby on Rails 的 ASP.NET MVC 开发人员。我已经有所了解并在 RoR 上创建了一个网站。在 ASP.NET MVC 上开发，我一直使用三层架构:
ruby - 通过 MacVim (!ruby) 执行时如何运行正确版本的 Ruby
最近我看到 Gary Bernhardt 展示了他用来在 vim 中执行 Ruby 代码的 vim 快捷方式。捷径是 :map ,t :w\|:!ruby %. 似乎这个方法总是执行系统 Rub
ruby - 如果 Ruby 的所有实现都被编译成字节码，Ruby 真的是一种解释型语言吗？
在为 this question about Blue Ruby 选择的答案中，查克说: All of the current Ruby implementations are compiled to
ruby-on-rails - Ruby:如何对 Ruby 数组进行分组？
我有一个 Ruby 数组 > list = Request.find_all_by_artist("Metallica").map(&:song) => ["Nothing else Matters"
ruby-on-rails - Ruby:Ruby 中的舍入 float
我在四舍五入时遇到问题。我有一个 float ，我想将其四舍五入到小数点后的百分之一。但是，我只能使用 .round ，它基本上将它变成一个 int，意思是 2.34.round # => 2. 有没
ruby-on-rails - ruby/ruby on rails 内存泄漏检测
我使用 ruby on rails 编写了一个小型 Web 应用程序，它的主要目的是上传、存储和显示来自 xml(文件最多几 MB)文件的结果。运行大约 2 个月后，我注意到 mongrel 进程
ruby - 转换奇怪的字符 - Ruby
我们如何用 Ruby 转换像这样的字符串: 𝑙𝑎𝑡𝑜𝑟𝑟𝑒 收件人: Latorre 最佳答案 s = "𝑙𝑎𝑡𝑜𝑟𝑟𝑒" => "𝑙𝑎𝑡𝑜𝑟𝑟𝑒" s.u
ruby - Ruby 变量前的感叹号
通过 ruby monk 时，他们偶尔会从左侧字段中抛出一段语法不熟悉的代码: def compute(xyz) return nil unless xyz xyz.map {|a,
ruby - 返回字符串中的最高和最低数字 : Ruby
不确定我做错了什么，但我似乎弄错了。问题是，给你一串空格分隔的数字，你必须返回最大和最小的数字。注意:所有数字都是有效的 Int32，不需要验证它们。输入字符串中始终至少有一个数字。输出字符串必须

数据小太阳

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

ruby - Flex (lexer) - 匹配 unicode