gpt4 book ai didi

正则表达式捕获 VBA 注释

转载 作者:行者123 更新时间:2023-12-05 01:04:59 29 4
gpt4 key购买 nike

我正在 try catch VBA 注释。到目前为止,我有以下内容

'[^";]+\Z

它捕获以单引号开头但在字符串结尾之前不包含任何双引号的任何内容。即它不会匹配双引号字符串中的单引号。
dim s as string        ' a string variable   -- works
s = "the cat's hat" ' quote within string -- works

但如果注释包含双引号字符串则失败

IE。
dim s as string ' string should be set to "ten"

我该如何修复我的正则表达式来处理这个问题?

最佳答案

@Jeff Wurz's comment中的图案( ^\'[^\r\n]+$|''[^\r\n]+$ ) 甚至不匹配您的任何测试样本,并且链接的问题没有用,那里的正则表达式只会匹配 OP 问题中的特定注释,而不是“VBA 注释语法”。

你提出的正则表达式比我放弃正则表达式方法时的效果更好。

做得好!

问题是您无法使用正则表达式解析 VBA 注释。

Lexers vs Parsers , @SasQ's answer在解释乔姆斯基的语法水平方面做得很好:

Level 3: Regular grammars

They use regular expressions, that is, they can consist only of the symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or alternatives (e.g. a|b). They can be implemented as finite state automata (FSA), like NFA (Nondeterministic Finite Automaton) or better DFA (Deterministic Finite Automaton). Regular grammars can't handle with nested syntax, e.g. properly nested/matched parentheses (()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because state automata to deal with it should have to have infinitely many states to handle infinitely many nesting levels.

Level 2: Context-free grammars

They can have nested, recursive, self-similar branches in their syntax trees, so they can handle with nested structures well. They can be implemented as state automaton with stack. This stack is used to represent the nesting level of the syntax. In practice, they're usually implemented as a top-down, recursive-descent parser which uses machine's procedure call stack to track the nesting level, and use recursively called procedures/functions for every non-terminal symbol in their syntax. But they can't handle with a context-sensitive syntax. E.g. when you have an expression x+3 and in one context this x could be a name of a variable, and in other context it could be a name of a function etc.

Level 1: Context-sensitive grammars



正则表达式根本不是解决这个问题的合适工具,因为每当有多个单引号(/撇号),或者当涉及双引号时,你需要弄清楚代码行中最左边的撇号是否是在双引号内,如果是,那么您需要匹配双引号并在结束双引号之后找到最左边的撇号 - 实际上,不属于字符串文字的最左边的撇号是您的注释标记。

我的理解是 VBA 注释语法是上下文相关的语法(级别 1),因为撇号只是您的标记,如果它不是字符串文字的一部分,并且要弄清楚撇号是否是字符串文字的一部分,最简单可能是从左到右走你的弦并切换一些 IsInsideQuote遇到双引号时标记......但前提是它们没有被转义(加倍)。实际上,您甚至不会检查字符串字面量中是否有撇号:您只需一直走,直到打开的引号关闭,并且仅当“引号内标志”为 False 时如果您遇到单引号,您会找到一个注释标记。

祝你好运!

这是您缺少的测试用例:
s = "abc'def ""xyz""'nutz!" 'string with apostrophes and escaped double quotes

如果您不关心捕获字符串文字,您可以简单地忽略转义的双引号并在此处查看 3 个字符串文字: "abc'def " , "xyz""'nutz!" .

此 C# 代码输出 'string with apostrophes and escaped double quotes (所有字符串内双引号都在代码中用反斜杠转义),并适用于我给它的所有测试字符串:

    static void Main(string[] args)
{
var instruction = "s = \"abc'def \"\"xyz\"\"'nutz!\" 'string with apostrophes and escaped double quotes";
// var instruction = "s = \"the cat's hat\" ' quote within string -- works";
// var instruction = "dim s as string ' string should be set to \"ten\"";

int? commentStart = null;

var isInsideQuotes = false;
for (var i = 0; i < instruction.Length; i++)
{
if (instruction[i] == '"')
{
isInsideQuotes = !isInsideQuotes;
}

if (!isInsideQuotes && instruction[i] == '\'')
{
commentStart = i;
break;
}
}

if (commentStart.HasValue)
{
Console.WriteLine(instruction.Substring(commentStart.Value));
}

Console.ReadLine();
}

那么如果你想捕获所有的法律评论,你需要处理遗留 Rem关键字,并考虑行延续:
Rem this is a legal comment
' this _
is also _
a legal comment

换句话说, \r\n本身不足以正确识别所有语句结束标记。

适当的词法分析器+解析器似乎是捕获所有评论的唯一方法。

关于正则表达式捕获 VBA 注释,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22044801/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com