gpt4 book ai didi

c# - 穷人的 C# "lexer"

转载 作者:IT王子 更新时间:2023-10-29 04:08:06 25 4
gpt4 key购买 nike

我正在尝试用 C# 编写一个非常简单的解析器。

我需要一个词法分析器——它可以让我将正则表达式与标记相关联,因此它会读取正则表达式并返回符号。

似乎我应该能够使用 Regex 来完成实际的繁重工作,但我看不到一种简单的方法来完成它。一方面,正则表达式似乎只适用于字符串,而不适用于流(这是为什么!?!?)。

基本上,我想要以下接口(interface)的实现:

interface ILexer : IDisposable
{
/// <summary>
/// Return true if there are more tokens to read
/// </summary>
bool HasMoreTokens { get; }
/// <summary>
/// The actual contents that matched the token
/// </summary>
string TokenContents { get; }
/// <summary>
/// The particular token in "tokenDefinitions" that was matched (e.g. "STRING", "NUMBER", "OPEN PARENS", "CLOSE PARENS"
/// </summary>
object Token { get; }
/// <summary>
/// Move to the next token
/// </summary>
void Next();
}

interface ILexerFactory
{
/// <summary>
/// Create a Lexer for converting a stream of characters into tokens
/// </summary>
/// <param name="reader">TextReader that supplies the underlying stream</param>
/// <param name="tokenDefinitions">A dictionary from regular expressions to their "token identifers"</param>
/// <returns>The lexer</returns>
ILexer CreateLexer(TextReader reader, IDictionary<string, object> tokenDefinitions);
}

所以,请发送鳕鱼...
不,说真的,我正要开始编写上述接口(interface)的实现,但我很难相信在 .NET (2.0) 中已经没有一些简单的方法可以做到这一点。

那么,对于执行上述操作的简单方法有什么建议吗? (此外,我不想要任何“代码生成器”。性能对于这件事并不重要,我不想在构建过程中引入任何复杂性。)

最佳答案

我在这里作为答案发布的原始版本有一个问题,因为它只有在有多个“正则表达式”匹配当前表达式时才有效。也就是说,一旦只有一个 Regex 匹配,它就会返回一个标记——而大多数人希望 Regex 是“贪婪的”。对于“带引号的字符串”之类的情况尤其如此。

位于 Regex 之上的唯一解决方案是逐行读取输入(这意味着您不能拥有跨越多行的标记)。我可以忍受这个——毕竟,这是一个穷人的词法分析器!此外,在任何情况下,从 Lexer 中获取行号信息通常都是有用的。

因此,这是一个解决这些问题的新版本。也归功于 this

public interface IMatcher
{
/// <summary>
/// Return the number of characters that this "regex" or equivalent
/// matches.
/// </summary>
/// <param name="text">The text to be matched</param>
/// <returns>The number of characters that matched</returns>
int Match(string text);
}

sealed class RegexMatcher : IMatcher
{
private readonly Regex regex;
public RegexMatcher(string regex) => this.regex = new Regex(string.Format("^{0}", regex));

public int Match(string text)
{
var m = regex.Match(text);
return m.Success ? m.Length : 0;
}
public override string ToString() => regex.ToString();
}

public sealed class TokenDefinition
{
public readonly IMatcher Matcher;
public readonly object Token;

public TokenDefinition(string regex, object token)
{
this.Matcher = new RegexMatcher(regex);
this.Token = token;
}
}

public sealed class Lexer : IDisposable
{
private readonly TextReader reader;
private readonly TokenDefinition[] tokenDefinitions;

private string lineRemaining;

public Lexer(TextReader reader, TokenDefinition[] tokenDefinitions)
{
this.reader = reader;
this.tokenDefinitions = tokenDefinitions;
nextLine();
}

private void nextLine()
{
do
{
lineRemaining = reader.ReadLine();
++LineNumber;
Position = 0;
} while (lineRemaining != null && lineRemaining.Length == 0);
}

public bool Next()
{
if (lineRemaining == null)
return false;
foreach (var def in tokenDefinitions)
{
var matched = def.Matcher.Match(lineRemaining);
if (matched > 0)
{
Position += matched;
Token = def.Token;
TokenContents = lineRemaining.Substring(0, matched);
lineRemaining = lineRemaining.Substring(matched);
if (lineRemaining.Length == 0)
nextLine();

return true;
}
}
throw new Exception(string.Format("Unable to match against any tokens at line {0} position {1} \"{2}\"",
LineNumber, Position, lineRemaining));
}

public string TokenContents { get; private set; }
public object Token { get; private set; }
public int LineNumber { get; private set; }
public int Position { get; private set; }

public void Dispose() => reader.Dispose();
}

示例程序:

string sample = @"( one (two 456 -43.2 "" \"" quoted"" ))";

var defs = new TokenDefinition[]
{
// Thanks to [steven levithan][2] for this great quoted string
// regex
new TokenDefinition(@"([""'])(?:\\\1|.)*?\1", "QUOTED-STRING"),
// Thanks to http://www.regular-expressions.info/floatingpoint.html
new TokenDefinition(@"[-+]?\d*\.\d+([eE][-+]?\d+)?", "FLOAT"),
new TokenDefinition(@"[-+]?\d+", "INT"),
new TokenDefinition(@"#t", "TRUE"),
new TokenDefinition(@"#f", "FALSE"),
new TokenDefinition(@"[*<>\?\-+/A-Za-z->!]+", "SYMBOL"),
new TokenDefinition(@"\.", "DOT"),
new TokenDefinition(@"\(", "LEFT"),
new TokenDefinition(@"\)", "RIGHT"),
new TokenDefinition(@"\s", "SPACE")
};

TextReader r = new StringReader(sample);
Lexer l = new Lexer(r, defs);
while (l.Next())
Console.WriteLine("Token: {0} Contents: {1}", l.Token, l.TokenContents);

输出:

Token: LEFT Contents: (
Token: SPACE Contents:
Token: SYMBOL Contents: one
Token: SPACE Contents:
Token: LEFT Contents: (
Token: SYMBOL Contents: two
Token: SPACE Contents:
Token: INT Contents: 456
Token: SPACE Contents:
Token: FLOAT Contents: -43.2
Token: SPACE Contents:
Token: QUOTED-STRING Contents: " \" quoted"
Token: SPACE Contents:
Token: RIGHT Contents: )
Token: RIGHT Contents: )

关于c# - 穷人的 C# "lexer",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/673113/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com