gpt4 book ai didi

java - 使用 REGEX 计算句子数并忽略首字母缩略词

转载 作者:搜寻专家 更新时间:2023-10-31 20:33:46 26 4
gpt4 key购买 nike

我尝试使用正则表达式计算文本中句子的数量。我想出了一个 regex1 找到所有点:

([^.!?\s][^.!?]*)

之后,我尝试通过以下 regex2 查找大部分首字母缩略词:

([A-Z]+[a-z]{0,3}\.).

但是我有几个问题:

  1. 如果首字母缩略词位于句子末尾,则可以通过 regex2 公式找到它(例如,自公元前 20,000 年以来)。这不是故意的,我只是想在一个句子中找到首字母缩略词。

  2. 如果我们假设问题 1 已解决,我想将两个正则表达式公式合并在一起,以便最终公式只输出实际数量的句子。例如,我们可以考虑来自维基百科的以下文本:

The National Aeronautics and Space Administration (NASA) is the United States government agency responsible for the civilian space program as well as aeronautics and aerospace research.

President Dwight D. Eisenhower established the National Aeronautics and Space Administration (NASA) in 1958[5] with a distinctly civilian (rather than military) orientation encouraging peaceful applications in space science. The National Aeronautics and Space Act was passed on July 29, 1958, disestablishing NASA's predecessor, the National Advisory Committee for Aeronautics (NACA). The new agency became operational on October 1, 1958.[6][7]

Since that time, most U.S. space exploration efforts have been led by NASA, including the Apollo moon-landing missions, the Skylab space station, and later the Space Shuttle. Currently, NASA is supporting the International Space Station and is overseeing the development of the Orion Multi-Purpose Crew Vehicle, the Space Launch System and Commercial Crew vehicles. The agency is also responsible for the Launch Services Program (LSP) which provides oversight of launch operations and countdown management for unmanned NASA launches.

NASA science is focused on better understanding Earth through the Earth Observing System,[8] advancing heliophysics through the efforts of the Science Mission Directorate's Heliophysics Research Program,[9] exploring bodies throughout the Solar System with advanced robotic spacecraft missions such as New Horizons,[10] and researching astrophysics topics, such as the Big Bang, through the Great Observatories and associated programs.[11] NASA shares data with various national and international organizations such as from the Greenhouse Gases Observing Satellite.

以上文字有9个句子。

Regex1:12 个匹配项(D.、U. 和 S. 被视为“句号”)

Regex2:3 个匹配项(D.、U. 和 S.)

我现在需要的是一个更好的 regex1 公式,只查找句子中的首字母缩略词,然后“合并”两个 regex 公式以接收所有句子。

如果无法合并两个公式(出于任何合理的原因),则只考虑问题 1,因为目前我的 JAVA 程序使用两个分开的公式:

public void breakIntoSentences()
{
//Find all points
Pattern p = Pattern.compile("([^.!?\\s][^.!?]*)");
Matcher m = p.matcher(content);

int allPoints = 0;
while(m.find())
allPoints++;

//Find all acronyms with length 0-4
p = Pattern.compile("([A-Z]+[a-z]{0,3}\\.)");
m = p.matcher(content);

int allAcronyms = 0;
while(m.find())
allAcronyms++;

numberOfSentences = allPoints - allAcronyms;
}

提前感谢您的帮助

最佳答案

这是一个模式:

.+?(?:(?<![\s.]\p{Lu})[.!?]|$)

Demo

  • .+?在这里只是为了匹配一个完整的句子。如果您只是想要一个计数,可以将其替换为 .
  • (?<![\s.]\p{Lu})表示前面没有大写字母本身前面有空格或句点。这在 [.!?] 之前使用检查句子的结尾。这似乎可以正确处理首字母缩略词。
  • $有没有就是强制非贪心.+?在开头匹配到文本结尾以防万一文本不以句点结尾。

此正则表达式处理 [6][7]作为下一句话的一部分。如果这 Not Acceptable ,您可以通过添加 [\d\[\]]* 来稍微调整模式。就在[.!?]之后.

关于java - 使用 REGEX 计算句子数并忽略首字母缩略词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29673147/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com