gpt4 book ai didi

antlr - 使用 ANTLR4 解析固定宽度的输入

转载 作者:行者123 更新时间:2023-12-04 17:53:54 25 4
gpt4 key购买 nike

我有一个奇怪的输入格式:

ACOMAND          1.0       1.0
ACOMAND
ACOMAND 1.0
ACOMAND 1.0 1.0 1300.2 .9 1.0
ACOMAND 1.0 1.0 1300.2 .9
ACOMAND OKK 1.0 1300.2 .9 1.0 WOW
ACOMAND 1.0 1.0 1300.2

每个命令都有自己的权限,其中缺失或空白的列隐式为零。基本上第一个字符串是左对齐的,所有其他字符串都是右对齐到第 20、30、40、...、80 列。第一列始终是 ID。所有其他列都是 ID 或 float 。空列(填充空格或什么都没有)隐式为零。

我该如何解析它?

我想过:

grammar WeirdGrammar;
comm: KEYWORD NEWLINE
| KEYWORD COLUMN NEWLINE
| KEYWORD COLUMN COLUMN NEWLINE
| KEYWORD COLUMN COLUMN COLUMN NEWLINE
| KEYWORD COLUMN COLUMN COLUMN COLUMN NEWLINE
| KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE
| KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE
| KEYWORD COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN COLUMN NEWLINE
;

KEYWORD: [A-Z] {getCharPositionInLine() == 1}? ([A-Z]|'-')* WS*? {getCharPositionInLine() == 10}? ;
COLUMN: .+? {(getCharPositionInLine() % 10) == 0}? ;
NEWLINE : '\r'? '\n' ;
WS : [ \t] ;

基本上,我们的想法是处理 KEYWORDCOLUMN 的所有组合,从 KEYWORDKEYWORD 后跟 7 个 COLUMNCOLUMN 宽度限制为 10,这是通过非贪婪地匹配任何内容来强制执行的,直到 CharPosition 与 10 的模数为零。关键字应该从行的开头开始,因此是该标记的第一条规则,然后它应该不超过第 10 列,因此是第二个谓词。但是目前这不起作用,而是返回:

line 1:0 mismatched input 'ACOMAND          1' expecting KEYWORD

即使在我天真的实现中,这仍然无法处理尾随空格,但我认为不强加尾随空格将是一个问题。

最佳答案

1) 使用 ANTLR 4.6 和给定的语法和输入,我得到以下消息:

line 3:0 no viable alternative at input 'ACOMAND    1.0    1.0\nACOMAND\nACOMAND  '

调试语法时,列出词法分析器看到的标记非常有用:

$ echo $CLASSPATH
.:/usr/local/lib/antlr-4.6-complete.jar
$ alias grun
alias grun='java org.antlr.v4.gui.TestRig'
$ grun Question question -tokens data.txt
[@0,0:9='ACOMAND ',<KEYWORD>,1:0]
[@1,10:19=' 1.0',<COLUMN>,1:10]
[@2,20:29=' 1.0',<COLUMN>,1:20]
[@3,30:30='\n',<COLUMN>,1:30]
[@4,31:38='ACOMAND\n',<COLUMN>,2:0]

4.6之前,显示tokens [@3,30:30='\n',<n>,1:30]你必须查看文件 -grammar-.tokens哪个 token 的编号为 n .现在翻译的很精彩,立马看到换行符已经被识别为token COLUMN , 不是 NEWLINE如你所料。这是因为词法分析器试图按顺序将输入与每个规则匹配:

  1. '\n'匹配[A-Z] ?不,所以它不是 KEYWORD , 下一条规则
  2. '\n'匹配.+? ?是的,所以它是 COLUMN , 没有机会到达 NEWLINE规则。

所以你需要把COLUMNNEWLINE 之后 规则规则。

您还会看到第二行输入已标记为 [@4,31:38='ACOMAND\n',<COLUMN>,2:0]因为它不能匹配

KEYWORD: [A-Z] ... WS*? 

因为规则需要空格,而且只有一个 NL。因此替换 WS*?通过 ( WS* | NEWLINE ) .

最后简化冗余规则:

grammar Question;

question
: KEYWORD COLUMN* NEWLINE
;

KEYWORD : [A-Z] {getCharPositionInLine() == 1}? ([A-Z]|'-')* ( WS* | NEWLINE ) {getCharPositionInLine() <= 10}? ;
NEWLINE : '\r'? '\n' ;
WS : [ \t] ;
COLUMN: .+? {(getCharPositionInLine() % 10) == 0}? ;

现在词法分析器提供:

[@0,0:9='ACOMAND   ',<KEYWORD>,1:0]
[@1,10:19=' 1.0',<COLUMN>,1:10]
[@2,20:29=' 1.0',<COLUMN>,1:20]
[@3,30:30='\n',<NEWLINE>,1:30]
[@4,31:38='ACOMAND\n',<KEYWORD>,2:0]

.

.

2)但这一切真的有用吗?解析器生成器是合适的工具吗?删除一个空格,看看会发生什么:

line 2:0 extraneous input 'ACOMAND\n' expecting {NEWLINE, COLUMN}

我认为你应该让词法分析器做一个没有这些位置限制的简单工作:为非空白数据创建一个标记并消除空白。稍后在解析器或监听器中,您可以检查位置:每个标记都具有开始、停止、行等属性。

为什么不是 Ruby 脚本? :-)

# Split 80 columns lines into 10 columns wide tokens, associate each token
# with its stop position in line (counting from 1) and an OK/WRONG flag
# if it is not aligned correctly.

tokens = Array.new

IO.readlines("data.txt").each_with_index do | line, i |
if i == 0
then
puts " #{line}"
next
end

line_tokens = Array.new
line = line.chomp # remove NL
print "line #{i + 1} : "
8.times.each do | n | # n = 0 to 7
a = n * 10 # begin of split range counting from 0
b = n * 10 + 9 # end of range
token = line.slice(a..b)
next if token.nil? || token.length == 0 # nil if edge case
print token
good_position = 'OK'
position = b + 1

case n
when 0 # first token must be at column 1
good_position = 'WRONG' if token[0] == ' '
else # other tokens must be right aligned in their 10 columns width field
if token[-1] == ' ' && token != ' ' # not followed by NL
then
good_position = 'WRONG'
unless (pos = token.rindex(' ')).nil?
position = position - 10 + pos - 1
end
end
if token.length != 10 # last in line
then
good_position = 'WRONG'
position = position - 10 + token.length
end
end

line_tokens << [token.strip, position, good_position]
break if b > line.length
end
puts # print a NL because print doesn't do it
tokens << line_tokens
end

puts
puts "Lists of tokens : "
p tokens

输入数据.txt :

....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8
ACOMAND 1.0 1.0
ACOMAND
ACOMAND 1.0
ACOMAND 1.0 1.0 1300.2 .9 1.0
ACOMAND 1.0 1.0 1300.2 .9
ACOMAND OKK 1.0 1300.2 .9 1.0 WOW
ACOMAND 1.0 1.0 1300.2

输出:

$ ruby -w split.rb 
....+....1....+....2....+....3....+....4....+....5....+....6....+....7....+....8
line 2 : ACOMAND 1.0 1.0
line 3 : ACOMAND
line 4 : ACOMAND 1.0
line 5 : ACOMAND 1.0 1.0 1300.2 .9 1.0
line 6 : ACOMAND 1.0 1.0 1300.2 .9
line 7 : ACOMAND OKK 1.0 1300.2 .9 1.0 WOW
line 8 : ACOMAND 1.0 1.0 1300.2

Lists of tokens :
[[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 29, "WRONG"]],
[["ACOMAND", 10, "OK"]], [["ACOMAND", 10, "OK"], ["1.0", 20, "OK"]],
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2",
40, "OK"], ["", 50, "OK"], [".9", 58, "WRONG"], ["1.0", 68, "WRONG"]],
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2",
40, "OK"], ["", 50, "OK"], [".9", 60, "OK"]], [["ACOMAND", 10, "OK"],
["OKK", 20, "OK"], ["1.0", 30, "OK"], ["1300.2", 40, "OK"], ["", 50,
"OK"], [".9", 60, "OK"], ["1.0", 70, "OK"], ["WOW", 80, "OK"]],
[["ACOMAND", 10, "OK"], ["1.0", 20, "OK"], ["1.0", 30, "OK"], ["1300.2",
40, "OK"]]]

关于antlr - 使用 ANTLR4 解析固定宽度的输入,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42101184/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com