gpt4 book ai didi

java - 从文本中提取信息

转载 作者:行者123 更新时间:2023-12-02 02:16:22 25 4
gpt4 key购买 nike

我有以下文字:

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.              

Name Group 12345678
ALEX A ALEX
ID# PUBLIC NETWORK
XYZ123456789


Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book.

我想提取文本中 ID# 关键字下的 ID 值。

问题是在不同的文本文件中 ID 可以位于不同的位置,例如在另一个文本的中间,如下所示:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's          
standard dummy text ever since the 1500s, when an unknown printer took a XYZ123456789 galley of type and scrambled it to make a type specimen book.

此外,ID# 和值之间可以有额外的行:

Lorem Ipsum is simply dummy text of                                          ID#             the printing and typesetting industry. Lorem Ipsum has been the industry's      
printing and typesetting industry. Lorem Ipsum has been the printing and typesetting industry. Lorem Ipsum has been the
standard dummy text ever since the 1500s, when an unknown printer took a XYZ123456789 galley of type and scrambled it to make a type specimen book.

您能否展示一种如何提取上述 ID# 值的方法?是否可以在此处应用任何标准技术来提取此信息?例如 RegEx 或 RegEx 之上的某种方法。这里可以应用NLP吗?

最佳答案

ID 值似乎没有明确的格式,因此单行正则表达式无济于事,因为这里几乎没有任何正则。

您必须使用两个正则表达式才能获得预期的输出。第一个是:

(?m)^(.*)ID#.*([\s\S]*)

它尝试查找ID#单独成行。它捕获两 block 字符串。第一个 block 是从该行开头到 ID# 的所有内容。那么 ID# 行之后出现的所有内容驻留。

然后我们计算第一个捕获组的长度。它为我们提供了列号,我们应该在下一行中开始搜索 ID:

m.group(1).length();

然后我们构建使用此长度的第二个正则表达式:

(?m)^.{X}(?<!\S)\h{0,3}(\S+)

分割:

  • (?m)启用多行模式
  • ^匹配行首
  • .{X}匹配前 X 个字符(X 为 m.group(1).length() )
  • (?<!\S)检查当前位置是否在空格字符之前
  • \h{0,3}匹配水平空格,可选最多 3 个字符(如果值向右移动)
  • (\S+)捕获以下非空白字符

然后我们在之前的正则表达式的第二个捕获组上运行这个正则表达式:

Matcher m = Pattern.compile("(?m)^(.*)ID#.*([\\s\\S]*)").matcher(string);                  
if (m.find()) {
Matcher m1 = Pattern.compile("(?m)^.{" + m.group(1).length() + "}(?<!\\S)\\h{0,3}(\\S+)").matcher(m.group(2));
if (m1.find())
System.out.println(m1.group(1));
}

Live demo

关于java - 从文本中提取信息,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49256044/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com