gpt4 book ai didi

file - 从 Prolog 中的大文件中提取文本

转载 作者:行者123 更新时间:2023-12-05 01:16:26 25 4
gpt4 key购买 nike

我想用 SWI-Prolog 提取开始和结束字符串之间的文本,例如,维基百科转储中的所有标题。我不想使用 XML 解析器,因为我想以相同的方式处理不同的文件类型。我让它适用于小文件,但遇到大文件的问题。

对于大文件(例如 Romanian Wikipedia)prolog 内存不足(prolog -G1G -L1G -T1G -s main.pl -t main,请参阅下面的 main.pl 的内容) :

Welcome to SWI-Prolog (threaded, 64 bits, version 7.4.2)
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software.
Please run ?- license. for legal details.

For online help and background, visit http://www.swi-prolog.org
For built-in help, use ?- help(Topic). or ?- apropos(Word).

found: 'Rocarta'
found: 'Muzică'
found: 'Iris (formație românească)'
found: 'Pagina principală'
...[removed hundreds of lines]
found: 'Zadar'
found: 'Australia'
found: 'Slovenia'
found: 'Croația'
ERROR: Out of global stack
Exception: (5,861) between([60, 116, 105, 116, 108, 101, 62], [60, 47, 116, 105, 116, 108, 101, 62], _264890370, [10, 32, 32, 32, 32, 60, 110, 115|...], []) ?

如何使用大输入文件完成此任务?

MWE (main.pl):

:- use_module(library(pio)).
:- use_module(library(dcg/basics)).
last_call_optimisation(true).

main :-
phrase_from_file(between(`<title>`, `</title>`, _), `wiki.xml`).

between(Start, End, Found) -->
string(_), string(Start), string(Found), string(End),
{ format("found: '~s' \n", [Found]) },
between(Start, End, _).
between(_, _, []) -->
remainder(_),
{ format("finished parsing") }.

示例输入(wiki.xml):

<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>

示例输出(预期):

found: 'Albert Einstein' 
found: 'Elvis Presley'
finished parsing

编辑:如果我们从 between/3 中删除递归调用,输出会发生变化,并且与我的预期不符:

 found: 'Albert Einstein' 
found: 'Albert Einstein</title></page>
<page><title>Elvis Presley'
found: 'Elvis Presley'
finished parsing

最佳答案

这个结构

..., string(_), string(Start),  ...

非常效率低下。它将线性解析变成指数解析。但是我们有一个非常简单的解决方案,因为字符串文字在 DCG 中执行精确匹配:

:- use_module(library(dcg/basics)).

main(Titles) :-
%phrase_from_file(between(`<title>`, `</title>`, Titles),`wiki.xml`).
phrase(between(`<title>`, `</title>`, Titles), `
<mediawiki>
>< Don't use an XML parser! ><
<page><title>Albert Einstein</title></page>
<page><title>Elvis Presley</title></page>
</mediawiki>
`).


between(_Start, _End, []) --> [].
between(Start, End, [Found|Rest]) -->
Start, string(String), End,
{ atom_codes(Found, String) },
!, between(Start, End, Rest).
between(Start, End, List) --> [_], between(Start, End, List).

不过我会简化代码:

...
phrase(tag(`title`, Titles), `
...

tag(_Tag, []) --> [].
tag(Tag, [Found|Rest]) -->
"<", Tag, ">", string(String), "</", Tag, ">",
{ atom_codes(Found, String) },
!, tag(Tag, Rest).
tag(Tag, List) --> [_], tag(Tag, List).

我敢打赌,对于大文件,这会稍微更有效率。它也很容易概括:

... 短语(标签([ title , footnote ], 内容), ` ...

tags(_Tags, []) --> [].
tags(Tags, [Key-Found|Rest]) -->
"<", {member(Tag, Tags)}, Tag, ">", string(String), "</", Tag, ">",
{ maplist(atom_codes, [Found,Key], [String,Tag]) },
!, tags(Tags, Rest).
tags(Tags, List) --> [_], tags(Tags, List).

但效率不高。更好(但应该通过剖析来证明这一点)

...
"<", string(Tag), ">", {memberchk(Tag, Tags)}, string(String), "</", Tag, ">",
...

编辑:至少在一小组标签上,"<", {member(Tag, Tags)}, Tag, ">"似乎需要比 "<", string(Tag), ">", {memberchk(Tag, Tags)}, 少得多的推论.

关于file - 从 Prolog 中的大文件中提取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46077052/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com