gpt4 book ai didi

c - 在 C 中编写语法标记器/解析器的最佳方法是什么?

转载 作者:行者123 更新时间:2023-12-03 18:52:16 24 4
gpt4 key购买 nike

关闭。这个问题需要更多focused .它目前不接受答案。












想改善这个问题吗?更新问题,使其仅关注一个问题 editing this post .

4年前关闭。




Improve this question




背景资料:
我渴望制作一种编程语言,知道这样做的工具,但我没有任何关于如何使用它们的好例子。我真的不想使用 Flex 或 Bison,因为它们没有教授我认为创建编译器所需的抽象性。我有创建字符串、标记它们、将它们提供给充当语法和解析的文件的概念,最终创建一个实际的程序来运行该语言。问题是,我不知道如何编写标记器或解析器。我有一般的想法,但当我看到例子时我会更好地理解。如果有人可以发布一个/几个示例,那就太好了!

我的问题如下:
有人可以发布如何在 C 中编写语法标记器/解析器的示例吗?

最佳答案

如果你想用 C 编写一个非常复杂的语法解析器,而不使用任何现有的模式匹配代码,通常最好实现一个状态机,然后一个字符一个字符地处理源代码。
Flex+Bison 的输出也只是一个状态机。 Flex 使用正则表达式将字符串标记为标记,然后将这些标记传递给 Bison 状态机,根据机器的当前状态一个接一个地处理标记。但是您不需要正则表达式标记器,您可以将输入标记为状态机处理的一部分。正则表达式匹配器本身也可以实现为状态机,因此 token 生成可以直接成为状态机的一部分。
这是一个有趣的链接;它不是特别是 C,而是状态机如何工作的一般概述,但是一旦您掌握了概念,就很容易将其转换为 C 代码:
Parsing command line arguments using a finite state machine and backtracking
下面是一些 super 原语的示例代码 CSV解析器:

#include <stdlib.h>
#include <stdio.h>

static char currentToken[4096];
static size_t currentTokenLength;

static
void addCharToCurrentToken ( char c ) {
if (currentTokenLength < sizeof(currentToken)) {
currentToken[currentTokenLength++] = c;
}
}

static
void printCurrentToken ( ) {
printf("Token: >>>%.*s<<<\n", (int)currentTokenLength, currentToken);
currentTokenLength = 0;
}


typedef enum {
STATE_FindStartOfData,
STATE_FindStartOfToken,
STATE_ParseNumber,
STATE_ParseString,
STATE_CheckEndOfString,
STATE_FindDelimiter,
STATE_ParseError,
STATE_EndOfData
} ParserState;


ParserState parserState = STATE_FindStartOfData;


static
void runTheStateMachine ( ) {
while (parserState != STATE_ParseError
&& parserState != STATE_EndOfData
) {
int c = fgetc(stdin);
// End of data?
if (c == -1) {
switch (parserState) {
case STATE_ParseNumber:
case STATE_CheckEndOfString:
printCurrentToken();
parserState = STATE_EndOfData;
break;

case STATE_ParseString:
// Data ends in the middle of token parsing? No way!
fprintf(stderr, "Data ended abruptly!\n");
parserState = STATE_ParseError;
break;

case STATE_FindStartOfData:
case STATE_FindStartOfToken:
case STATE_FindDelimiter:
// This is okay, data stream may end while in these states
parserState = STATE_EndOfData;
break;

case STATE_ParseError:
case STATE_EndOfData:
break;
}
}

switch (parserState) {
case STATE_FindStartOfData:
// Skip blank lines
if (c == '\n' || c == '\r') break;
// !!!FALLTHROUGH!!!

case STATE_FindStartOfToken:
// Skip overe all whitespace
if (c == ' ' || c == '\t') break;
// Start of string?
if (c == '"') {
parserState = STATE_ParseString;
break;
}
// Blank field?
if (c == ',') {
printCurrentToken();
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else can only be a number
parserState = STATE_ParseNumber;
addCharToCurrentToken(c);
break;

case STATE_ParseNumber:
if (c == ' ' || c == '\t') {
// Numbers cannot contain spaces in the middle,
// so this must be the end of the number.
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// This time the number ends directly with a delimiter
printCurrentToken();
parserState = STATE_FindStartOfToken;
break;
}
// End of dataset?
if (c == '\n' || c == '\r') {
printCurrentToken();
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Otherwise keep reading the number
addCharToCurrentToken(c);
break;

case STATE_ParseString:
if (c == '"') {
// Either this is the regular end of the string or it is just an
// escaped quotation mark which is doubled ("") in CVS.
parserState = STATE_CheckEndOfString;
break;
}
// All other chars are just treated as ordinary chars
addCharToCurrentToken(c);
break;

case STATE_CheckEndOfString:
if (c == '"') {
// Next char is also a quotation mark,
// so this was not the end of the string.
addCharToCurrentToken(c);
parserState = STATE_ParseString;
break;
}
if (c == ' ' || c == '\t') {
// It was the end of the string
printCurrentToken();
// We still need to find the real delimiter, though.
parserState = STATE_FindDelimiter;
break;
}
if (c == ',') {
// It was the end of the string
printCurrentToken();
// And we even found the delimiter
parserState = STATE_FindStartOfToken;
break;
}
if (c == '\n' || c == '\r') {
// It was the end of the string
printCurrentToken();
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Everything else is a parse error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of string!\n", c);
parserState = STATE_ParseError;
break;

case STATE_FindDelimiter:
// Delemiter found?
if (c == ',') {
parserState = STATE_FindStartOfToken;
break;
}
// Just skip overe all whitespace
if (c == ' ' || c == '\t') break;
// End of dataset?
if (c == '\n' || c == '\r') {
// And we even found the end of this dataset
printf("------------------------------------------\n");
parserState = STATE_FindStartOfData;
break;
}
// Anything else a pare error I guess
fprintf(stderr, "Unexpected char 0x%02X after end of token!\n", c);
parserState = STATE_ParseError;
break;

case STATE_ParseError:
// Nothing to do
break;

case STATE_EndOfData:
// Nothing to do
break;
}
}
}

int main ( ) {
runTheStateMachine();
return (parserState == STATE_EndOfData ? 0 : 1);
}
该代码做出以下假设:
  • token 永远不会超过 4096 个字符。
  • 分隔符是逗号
    (这就是 CVS 所暗示的,但并非所有 CVS 文件都为此使用逗号)
  • 字符串总是被引用
    (通常这是可选的,除非它们包含空格或引号)
  • 带引号的字符串内没有换行符
    (这通常是允许的)
  • 该代码假定所有未引用的都是数字,但它不会验证数字的格式是否正确。

  • 此代码绝对无法解析您提供的任何 CSV 数据,但是当您提供该文件时:
    "Year","Brand","Model"   ,"Description",  "Price"
    1997,"Ford", "E350","ac, abs, moon", 3000.00
    1999,"Chevy","Venture ""Extended Edition""",,4900.00
    1999,"Chevy", "Venture ""Extended Edition, Very Large""" , , 5000.00
    1996,"Jeep", "Grand Cherokee","MUST SELL!"
    它将产生以下输出:
    Token: >>>Year<<<
    Token: >>>Brand<<<
    Token: >>>Model<<<
    Token: >>>Description<<<
    Token: >>>Price<<<
    ------------------------------------------
    Token: >>>1997<<<
    Token: >>>Ford<<<
    Token: >>>E350<<<
    Token: >>>ac, abs, moon<<<
    Token: >>>3000.00<<<
    ------------------------------------------
    Token: >>>1999<<<
    Token: >>>Chevy<<<
    Token: >>>Venture "Extended Edition"<<<
    Token: >>><<<
    Token: >>>4900.00<<<
    ------------------------------------------
    Token: >>>1999<<<
    Token: >>>Chevy<<<
    Token: >>>Venture "Extended Edition, Very Large"<<<
    Token: >>><<<
    Token: >>>5000.00<<<
    ------------------------------------------
    Token: >>>1996<<<
    Token: >>>Jeep<<<
    Token: >>>Grand Cherokee<<<
    Token: >>>MUST SELL!<<<
    ------------------------------------------
    它只应该让您了解如何使用状态机解析复杂的语法。此代码远非生产质量,如您所见,例如 switch很快就会变得巨大,所以我至少会将状态代码放入函数中,甚至将每个状态都转换为结构或对象之类的东西用于数据封装,否则整个事情很快就会变得无法管理。

    关于c - 在 C 中编写语法标记器/解析器的最佳方法是什么?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40539418/

    24 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com