gpt4 book ai didi

java - Java中通过正则表达式提取

转载 作者:行者123 更新时间:2023-12-01 17:05:55 25 4
gpt4 key购买 nike

我用 Java 编写了一个程序,使用 RE 来提取一些信息。

该代码的目的是从文本文件中提取一些信息(在 = 之后和 | 符号之前),该信息位于 {{cite book .....}} 的中间

我的代码:

  final String regex = "(?:\\{\\{cite book\\b[^|]*|\\G(?!^))(?=[^}]*}})\\|([^=]+)=([^|}]+)";

final Pattern pattern1 = Pattern.compile(regex);
final Matcher matcher1 = pattern1.matcher(wikifile);
System.out.println("+++++++++++++++++++++++++++++++++++++++++++++++++");
System.out.println("\n BOOK: \n ");



while (matcher1.find()) {
if (matcher1.group(1).trim().equals("title")) System.out.println("\n----------------------\n");

if (matcher1.group(1).trim().equals("title")||matcher1.group(1).trim().equals("first")||matcher1.group(1).trim().equals("last")||matcher1.group(1).trim().equals("auther")||matcher1.group(1).trim().equals("url") || matcher1.group(1).trim().equals("publisher") ||matcher1.group(1).trim().equals("isbn")) {

System.out.println(matcher1.group(1) + " = " + matcher1.group(2));

}
}

当信息位于多行中时,它效果很好,但当信息位于一长行中时,它不会提取我想要的所有信息,而且我不知道原因是什么..

喜欢

{{Cite book|url=https://books.google.es/books?id=HuSQGrRY7F4C|title=Ajax Black Book, New Edition (With Cd)|last=Kogent Solutions Inc|first =|publisher = Dreamtech Press|year=2008|isbn=978-8177228380|location=|pages =40}}

我想提取(URL、标题、最后一个、第一个、出版商、isbn)

但是输出是

 BOOK: 

url = https://books.google.es/books?id=husqgrry7f4c

----------------------

title = ajax black book, new edition (with cd)
last = kogent solutions inc

当输入Like时

 {{Cite book
|url=https://books.google.es/books?id=HuSQGrRY7F4C
|title=Ajax Black Book, New Edition (With Cd)
|last=Kogent Solutions Inc
|first =
|publisher = Dreamtech Press
|year=2008
|isbn=978-817722838
|location=
|pages =40}} </ref>

输出看起来像

 BOOK: 

url = https://books.google.es/books?id=husqgrry7f4c


----------------------

title = ajax black book, new edition (with cd)

last = kogent solutions inc

first =

publisher = dreamtech press

isbn = 978-817722838

last = flanagan

first = david

更新:我认为模式(正则表达式)存在问题,当 =| 之间有 Null 或没有空格时,当元素为NULL 就像 first=|location=| 并且它在我不知道的一行中

2- 有没有办法通过使用 RE Patron 而不是使用来提取(url、title、publisger ..etc).group(1).trim().equals("标题")

谢谢

最佳答案

上次更新

正则表达式仅搜索带有前缀 {{Cite book 的数据,并选择多个由竖线 '|' 字符分隔的 key=value 对:

(?i:(?<=^|\|)({{Cite\sbook\s)|(\s*[^{|}\=]+)\s*\=\s*([^{|}][ ]))

以下代码演示了此正则表达式:

static final int PREFIX_GROUP = 1;
static final int FIELD_NAME_GROUP = 2;
static final int FIELD_VALUE_GROUP = 3;

// .....
String regex = "(?i:(?<=^|\\|)(\\{\\{Cite\\s*book\\s*)|([^{|}\=]+)\\s*\\=\\s*([^{|}]*[ ]*))";
Pattern pattern = Pattern.compile(regex);

String txt = "{{cite book\n | url=https://books.google.es/books?id=HuSQGrRY7F4C\n | \"title\"=Ajax Black Book, New Edition (With Cd)\n | 'last'=Kogent Solutions Inc | fir$$t =| publisher = Dreamtech Press\n|editor_1= \"William Gates III, Jr.\" |some.dashed-field=TestDot.NET|year=2008\n| isbn=978-8177228380\n|location=\n|key_w/o_value|pages =40|}}";

Matcher match = pattern.matcher(txt);
while (match.find()) {
if (match.group(PREFIX_GROUP) != null) {
System.out.println("prefix: " + match.group(PREFIX_GROUP).trim());
}
if (match.group(FIELD_NAME_GROUP) != null) {
String key = match.group(FIELD_NAME_GROUP).trim();
String value = match.group(FIELD_VALUE_GROUP).trim();
System.out.println(key + " = " + value);
}
}

并产生输出:

prefix: {{cite book
url = https://books.google.es/books?id=HuSQGrRY7F4C
"title" = Ajax Black Book, New Edition (With Cd)
'last' = Kogent Solutions Inc
fir$$t =
publisher = Dreamtech Press
editor_1 = "William Gates III, Jr."
some.dashed-field = TestDot.NET
year = 2008
isbn = 978-8177228380
location =
pages = 40

关于java - Java中通过正则表达式提取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/61465745/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com