Lexical Analyzer with table-driven and character classification in rust(基于表驱动和字符分类的生锈词法分析器)-6ren

Lexical Analyzer with table-driven and character classification in rust(基于表驱动和字符分类的生锈词法分析器)

转载作者：bug小助手更新时间：2023-10-25 16:35:20

25

4

I am writing a lexer for C language as an exercise in Rust, for now I succedded implementing a lexer using backtracking, as follows:

作为Rust的练习，我正在为C语言编写一个词法分析器，现在我成功地使用回溯实现了一个词法分析器，如下所示：

lex_keyword(input)
.or_else(|| lex_identifier(input))
.or_else(|| lex_integer(input))
...
.unwrap_or(lex_unexpected(input));

This works as expected, however it is a bit slow, because it tries every possible solution until it finds one, and every lex_function, checks the whole input if it is valid or not and then tries to map to a token, for example on identifiers:

这按预期工作，但是有点慢，因为它尝试了所有可能的解决方案，直到找到一个解决方案，并且每个lex_函数都检查整个输入是否有效，然后尝试映射到令牌，例如在标识符上：

 let end = input.iter().position(|byte| byte.is_ascii_whitespace() || /* is a punctuator */).unwrap_or(input.len());

// And then

if input[..end].iter().all(|byte| /* is valid identifier member */) { /* Return an identifer */ } else { None }

I have read about how lexer generators implement a lexer analyzer.

我读过有关词法分析器生成器如何实现词法分析器的文章。

So they basically convert the grammar to a NFA and then to a DFA and performs the main loop (which is good in theory).

因此，它们基本上将语法转换为NFA，然后转换为DFA，并执行主循环(这在理论上是好的)。

I have read about lex and flex and yacc, and have seen that they do not generate just one lookup table for the DFA, but a bunch of them and also use character classes. And also their lookup tables aren't so huge, if you would implement a DFA, where a row(state) would have 128-columns (ASCII characters).

我读过有关lex、flex和yacc的文章，看到它们不仅为DFA生成了一个查找表，而且还使用了字符类。而且，如果您要实现一个DFA，那么它们的查找表也不是那么大，其中一行(STATE)将有128列(ASCII字符)。

My questions is, how to implement character classes, and how to use them in the lookup table for the DFA, or can i create different lookup tables for different rules, for example:

我的问题是，如何实现字符类，以及如何在DFA的查找表中使用它们，或者我可以为不同的规则创建不同的查找表，例如：

token ::=
  keyword
| identifeer
| integer_literal
| float_literal
| ...

DFA_FOR_KEYWORD = [...]
DFA_FOR_IDENTIFIER = [...]
DFA_FOR_INTEGER = [...]
DFA_FOR_FLOAT = [...]
LEXER_DFA = [/* Which connects all the DFAs */]

Also about character_classes, how are they encoded, and can you reuse the same class for different parts?

另外，关于CHARACTER_CLASS，它们是如何编码的，您可以为不同的部分重用相同的类吗？

For example:

例如：

Let's say that an identifier has to end with the letter 'p'

假设一个标识符必须以字母‘p’结尾

identifer ::= [a-zA-Z]*p

How many classes would I have?

我要上几节课？

One for a-z, one for A-Z and one for 'p'

One for a-zA-Z, and one for 'p'

One for a-zA-Z, and check if it end with 'p'

Another question would be, how do you assemble them in the main loop?

另一个问题是，如何在主循环中组装它们？

Given the transition table, and given the character class and the last state, how do you now that is the next transition (in other words how to you codify the transitions table to move according to the character class)?

给定转换表，给定字符类和最后一个状态，你如何知道这是下一个转换（换句话说，你如何编码转换表，以根据字符类移动）？

I look up on this link https://www.cs.man.ac.uk/%5C~pjj/cs211/ho/node6.html, but couldn't understand where the character classes are used or how are they encodded.

我在这个链接https://www.cs.man.ac.uk/%5C~pjj/cs211/ho/node6.html，上查找，但不能理解字符类在哪里使用，或者它们是如何编码的。

Also if you have some resources regarding this problem, please feel free to share

另外，如果你有一些关于这个问题的资源，请随时分享

Thank you very much for your advice

非常感谢您的建议

P.S: I understand that a lexer generator does all of these things, however I would like to implement it by hand as a part of exercise.

附注：我知道词法分析器生成器可以做所有这些事情，但是我想作为练习的一部分手动实现它。

更多回答

Removed the rust-tag as this question is overly broad.

去掉了这个问题的铁锈标签，因为这个问题过于宽泛。

优秀答案推荐

更多回答

25

4

0

machine-learning - 文本分类: Multilable Text Classification vs Multiclass Text Classification
我对处理多标签分类问题的方法有疑问。根据文献综述，我发现一种最常用的方法是问题转化方法。它将多标签问题转化为多个单标签问题，分类结果只是每个单标签分类器的简单并集，采用二元相关方法。由于单标签问题
scala - org.apache.spark.ml.classification 和 org.apache.spark.mllib.classification 之间的区别
我正在编写一个 Spark 应用程序，并且希望使用 MLlib 中的算法。在 API 文档中，我发现同一算法有两个不同的类。例如，org.apache.spark.ml.classification
classification - 通过神经网络分类器计算图像显着性
假设我们有一个经过训练的卷积神经网络，可以在 Tensor-Flow 中对(w.l.o.g. 灰度)图像进行分类。给定经过训练的网络和测试图像，我们可以追踪其中哪些像素是显着的，或者“等效地”哪些像
classification - 在贝叶斯分类器中检测未知类
如果您有一个为一组类训练的贝叶斯分类器，如何检测输出是否足够重要以选择一个类？这对于检测无法分配给类的样本很有用。我曾尝试测试类概率是否高于所有类的概率的均值+2*stddev，但我认为它不会很健壮。
classification - 将多个模型的输出合并为一个模型
我目前正在寻找一种可以将多个模型的输出组合成一个模型的方法，我需要创建一个进行分类的 CNN 网络。图像被分成多个部分(如颜色所见)，每个部分作为输入给某个模型(1,2,3,4)每个模型的结构是相同
classification - 如何调整分类任务中标签的分级偏差？
我目前正在处理 convolutional neural network用于病理变化检测 x-ray images .这是一个简单的binary classification任务。在项目开始时，我们聚
classification - C5算法实现？
你知道我在哪里可以找到这个算法的一些信息，研究它吗？？。是否已经有其实现的示例，或者只有 Quinlan知道它的实现吗？？最佳答案他的公司 rulequest 拥有:http://ruleques
classification - k最近邻算法中k的值
我有7个类别需要分类，并且有10个功能。在这种情况下，我是否需要使用k的最佳值，还是我必须针对1到10(大约10)之间的k值运行KNN，并借助算法本身确定最佳值？最佳答案除了我在评论中发布的the
classification - 橙色用于数据挖掘和看不见的测试集
我正在使用 Orange 进行数据挖掘 ( http://orange.biolab.si/ ) 1 尤其是 LinearSVM。有没有办法保存学习到的模型并将其与看不见的测试集一起使用？我需要查看预
classification - 没有训练数据时如何对聊天文本进行分类？
我有一个用例，其中要对聊天文本进行分类。我想在 Apache OpenNLP 中使用 DocumentCategorizer 对聊天进行分类。但为此，我必须拥有已经对聊天进行分类的训练数据。我是否必须
classification - 使用Mahout从分类的用户行为进行用户配置文件
我正在尝试使用Mahout对用户进行聚类和分类。目前，我正处于计划阶段，我的想法与想法完全融合在一起，并且由于我是该领域的新手，所以我一直坚持进行数据格式化。假设我们有两个数据表(足够大)。在第一个
classification - 修剪决策树
当训练集中的示例太少时，如何使用 ID3 修剪决策树构建。我不能把它分成训练集、验证集和测试集，所以这是不可能的。是否有任何可以使用的统计方法或类似方法？最佳答案是的，当您的数据量较少时，可以
classification - 如何在WEKA中读取分类器混淆矩阵
抱歉，我是WEKA的新手，正在学习。在我的决策树（J48）分类器输出中，有一个混淆矩阵： a b <----- classified as 130 8 a = functiona
classification - 多类分类还是回归？
我正在尝试训练一个 CNN 模型，根据它们的美学分数对图像进行分类。有 2,00,000 张图像，每张图像都由 100 多个对象评分。计算平均分数并且将分数归一化。分数的分布近似高斯分布。所以我决定
classification - 计算多类分类的准确度
考虑具有以下混淆矩阵的三类分类问题。 cm_matrix = predict_class1 predict_class2 predict_class3
classification - 计算多类分类的准确度
考虑具有以下混淆矩阵的三类分类问题。 cm_matrix = predict_class1 predict_class2 predict_class3
classification - 如何解读weka分类？
我们如何使用朴素贝叶斯解释 weka 中的分类结果？平均值、标准偏差、权重总和和精度是如何计算的？ kappa 统计量、平均绝对误差、均方根误差等如何计算？混淆矩阵的解释是什么？最佳答案下面是
classification - 如何打印混淆矩阵的标签和列名？
我得到了混淆矩阵，但由于我的实际数据集有很多分类类别，因此很难理解。例子 - >>> from sklearn.metrics import confusion_matrix >>> y_test
classification - WEKA 工具包中隐马尔可夫模型的等价物是什么？
我需要对来自由 8 个加速度计组成的传感器网络的数据流进行分类。每个加速度计给我一个 X Y 和 Z 值。因此，在每个样本中，我有 8 x 3 = 24 个加速度值。我以大约 30 Hz 的频率进行采
classification - 用于商业用途的免费主题分类(分类系统)
我正在寻找完全免费的免费分类法。在我的研究中，杜威有法律问题。美国国会图书馆分类受版权保护，但在美国除外。 DMOZ 需要用户更新。如果我错了，请纠正我。那么，是否有任何完全免费的商业用途分类法？

首页

博学

6Ren·AI

商城

Lexical Analyzer with table-driven and character classification in rust(基于表驱动和字符分类的生锈词法分析器)