gpt4 book ai didi

performance - 如何提高词法效率?

转载 作者:行者123 更新时间:2023-12-04 02:39:14 24 4
gpt4 key购买 nike

DCG 解析一个 3 GB 的大文件,效率很重要。

我的词法分析器的当前版本主要使用 or 谓词 ;/2但我读到索引可以提供帮助。

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.



有人可以举一个使用索引进行词法分析的例子,并可能解释它如何提高效率吗?

细节

注意:在将源代码处理到这个问题之前,我更改了一些名称。如果您发现错误,请随时在此处编辑或给我留言,我会很乐意修复它。

目前我的词法分析器/标记器(基于 mzapotoczny/ prolog-interpreter parser.pl )是这个
% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
white_space,
(
( ":", !, { Token = tokColon }
; "(", !, { Token = tokLParen }
; ")", !, { Token = tokRParen }
; "{", !, { Token = tokLMusta}
; "}", !, { Token = tokRMusta}
; "\\", !, { Token = tokSlash}
; "->", !, { Token = tokImpl}
; "+", !, { Token = tokPlus }
; "-", !, { Token = tokMinus }
; "*", !, { Token = tokTimes }
; "=", !, { Token = tokEqual }
; "<", !, { Token = tokLt }
; ">", !, { Token = tokGt }
; "_", !, { Token = tokUnderscore }
; ".", !, { Token = tokPeriod }
; "/", !, { Token = tokForwardSlash }
; ",", !, { Token = tokComma }
; ";", !, { Token = tokSemicolon }
; digit(D), !,
number(D, N),
{ Token = tokNumber(N) }
; letter(L), !, identifier(L, Id),
{ member((Id, Token), [ (div, tokDiv),
(mod, tokMod),
(where, tokWhere)]),
!
; Token = tokVar(Id)
}
; [_],
{ Token = tokUnknown }
),
!,
{ Tokens = [Token | TokList] },
lexer(TokList)
; [],
{ Tokens = [] }
).

white_space -->
[Char], { code_type(Char, space) }, !, white_space.
white_space -->
"--", whole_line, !, white_space.
white_space -->
[].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
[D],
{ code_type(D, digit) }.

digits([D|T]) -->
digit(D),
!,
digits(T).
digits([]) -->
[].

number(D, N) -->
digits(Ds),
{ number_chars(N, [D|Ds]) }.

letter(L) -->
[L], { code_type(L, alpha) }.

alphanum([A|T]) -->
[A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
[].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
alphanum(As),
{ atom_codes(Id, [L|As]) }.

下面是一些用于开发和测试的辅助谓词。
read_file_for_lexing_and_user_review(Path) :-
open(Path,read,Input),
read_input_for_user_review(Input), !,
close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
open(Path,read,Input),
read_input_for_performance(Input,0,Limit), !,
close(Input).

read_input(Input) :-
at_end_of_stream(Input).

read_input(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
read_input(Input).

read_input_for_user_review(Input) :-
at_end_of_stream(Input).

read_input_for_user_review(Input) :-
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line_for_user_review(Line),
nl,
print('Press spacebar to continue or any other key to exit: '),
get_single_char(Key),
process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
Count >= Limit.

read_input_for_performance(Input,_,_) :-
at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
% print(Count0),
\+ at_end_of_stream(Input),
read_string(Input, "\n", "\r\t ", _, Line),
lex_line(Line),
Count is Count0 + 1,
read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :- % space bar
nl, nl,
read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
Key \= 32.

lex_line_for_user_review(Line) :-
lex_line(Line,TokList),
print(Line),
nl,
print(TokList),
nl.

lex_line(Line,TokList) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).

lex_line(Line) :-
string_chars(Line,Code_line),
phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
print('Enter a line to parse or just Enter to exit: '),
nl,
read_string(user, "\n", "\r", _, String),
nl,
lex_line_for_user_review(String),
nl,
continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,N),
N > 0,
read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
string_length(String,0).
read_user_input_for_lexing_and_user_review/0允许用户在终端输入字符串以进行词法分析并查看 token 。
read_file_for_lexing_and_user_review/1读取一个文件进行词法分析,并一次一行地查看每一行的标记。
read_file_for_lexing_and_performance/2读取一个文件进行词法分析,并限制 lex 的行数。这用于收集基本性能统计数据以衡量效率。旨在与 time/1 一起使用.

最佳答案

这意味着一件事是这是愚蠢的代码:

token(T) -->
( "1", !, { T = one }
; "2", !, { T = two }
; "3", !, { T = three }
)

这是不那么愚蠢的代码:
token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

但还是没那么好。可能更好:
token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

最后一个例子也开始看起来很傻,但请记住,现在您对第一个参数进行了索引。你读一次,没有选择点,没有回溯。

但如果你想真正知道如何高效写作,你需要衡量时间和空间的去向。你量过吗?

但是如果你真的想知道如何修复你可以阅读“Craft of Prolog”,我不明白这本书的全部,但我记得它有很大一部分是关于 DCG 的。

但是如果你真的想解析这种格式的大文件,也许可以找到其他语言的现有库,它可能比最快的 Prolog 快得多。

关于performance - 如何提高词法效率?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54259696/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com