gpt4 book ai didi

Lexical Analyzer with table-driven and character classification in rust(基于表驱动和字符分类的生锈词法分析器)

转载 作者:bug小助手 更新时间:2023-10-25 16:35:20 25 4
gpt4 key购买 nike



I am writing a lexer for C language as an exercise in Rust, for now I succedded implementing a lexer using backtracking, as follows:

作为Rust的练习,我正在为C语言编写一个词法分析器,现在我成功地使用回溯实现了一个词法分析器,如下所示:


lex_keyword(input)
.or_else(|| lex_identifier(input))
.or_else(|| lex_integer(input))
...
.unwrap_or(lex_unexpected(input));

This works as expected, however it is a bit slow, because it tries every possible solution until it finds one, and every lex_function, checks the whole input if it is valid or not and then tries to map to a token, for example on identifiers:

这按预期工作,但是有点慢,因为它尝试了所有可能的解决方案,直到找到一个解决方案,并且每个lex_函数都检查整个输入是否有效,然后尝试映射到令牌,例如在标识符上:


 let end = input.iter().position(|byte| byte.is_ascii_whitespace() || /* is a punctuator */).unwrap_or(input.len());

// And then

if input[..end].iter().all(|byte| /* is valid identifier member */) { /* Return an identifer */ } else { None }

I have read about how lexer generators implement a lexer analyzer.

我读过有关词法分析器生成器如何实现词法分析器的文章。


So they basically convert the grammar to a NFA and then to a DFA and performs the main loop (which is good in theory).

因此,它们基本上将语法转换为NFA,然后转换为DFA,并执行主循环(这在理论上是好的)。


I have read about lex and flex and yacc, and have seen that they do not generate just one lookup table for the DFA, but a bunch of them and also use character classes. And also their lookup tables aren't so huge, if you would implement a DFA, where a row(state) would have 128-columns (ASCII characters).

我读过有关lex、flex和yacc的文章,看到它们不仅为DFA生成了一个查找表,而且还使用了字符类。而且,如果您要实现一个DFA,那么它们的查找表也不是那么大,其中一行(STATE)将有128列(ASCII字符)。


My questions is, how to implement character classes, and how to use them in the lookup table for the DFA, or can i create different lookup tables for different rules, for example:

我的问题是,如何实现字符类,以及如何在DFA的查找表中使用它们,或者我可以为不同的规则创建不同的查找表,例如:


token ::=
keyword
| identifeer
| integer_literal
| float_literal
| ...

DFA_FOR_KEYWORD = [...]
DFA_FOR_IDENTIFIER = [...]
DFA_FOR_INTEGER = [...]
DFA_FOR_FLOAT = [...]
LEXER_DFA = [/* Which connects all the DFAs */]

Also about character_classes, how are they encoded, and can you reuse the same class for different parts?

另外,关于CHARACTER_CLASS,它们是如何编码的,您可以为不同的部分重用相同的类吗?




For example:

例如:


Let's say that an identifier has to end with the letter 'p'

假设一个标识符必须以字母‘p’结尾


identifer ::= [a-zA-Z]*p

How many classes would I have?

我要上几节课?



  1. One for a-z, one for A-Z and one for 'p'

  2. One for a-zA-Z, and one for 'p'

  3. One for a-zA-Z, and check if it end with 'p'


Another question would be, how do you assemble them in the main loop?

另一个问题是,如何在主循环中组装它们?


Given the transition table, and given the character class and the last state, how do you now that is the next transition (in other words how to you codify the transitions table to move according to the character class)?

给定转换表,给定字符类和最后一个状态,你如何知道这是下一个转换(换句话说,你如何编码转换表,以根据字符类移动)?


I look up on this link https://www.cs.man.ac.uk/%5C~pjj/cs211/ho/node6.html, but couldn't understand where the character classes are used or how are they encodded.

我在这个链接https://www.cs.man.ac.uk/%5C~pjj/cs211/ho/node6.html,上查找,但不能理解字符类在哪里使用,或者它们是如何编码的。


Also if you have some resources regarding this problem, please feel free to share

另外,如果你有一些关于这个问题的资源,请随时分享


Thank you very much for your advice

非常感谢您的建议


P.S: I understand that a lexer generator does all of these things, however I would like to implement it by hand as a part of exercise.

附注:我知道词法分析器生成器可以做所有这些事情,但是我想作为练习的一部分手动实现它。


更多回答

Removed the rust-tag as this question is overly broad.

去掉了这个问题的铁锈标签,因为这个问题过于宽泛。

优秀答案推荐
更多回答

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com