nlp - 独立子句边界消歧和独立子句分割——有什么工具可以做到这一点？-6ren

nlp - 独立子句边界消歧和独立子句分割——有什么工具可以做到这一点？

转载作者：行者123 更新时间：2023-12-03 20:23:07

我记得很久以前浏览过 NLTK 站点的句子分割部分。

我使用“句号”“空格”与“句号”“手动换行符”的粗文本替换来实现句子分割，例如使用 Microsoft Word 替换( . -> .^p )或 Chrome 扩展:

https://github.com/AhmadHassanAwan/Sentence-Segmentation

https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha

这不是像 NLTK 的 Punkt tokenizer 那样的 NLP 方法。

我分段以帮助我更轻松地定位和重读句子，这有时有助于阅读理解。

独立子句边界消歧和独立子句分割怎么样？是否有任何工具试图做到这一点？

下面是一些示例文本。如果可以在一个句子中识别出一个独立的子句，那就是 split 了。从句尾开始，向左移动，贪婪地 split :

例如。

Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where

sentences begin and end.

Often, natural language processing tools

require their input to be divided into sentences for a number of reasons.

However, sentence boundary identification is challenging because punctuation

marks are often ambiguous.

For example, a period may

denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence.

About 47% of the periods in the Wall Street Journal corpus

denote abbreviations.[1]

As well, question marks and exclamation marks may

appear in embedded quotations, emoticons, computer code, and slang.

Another approach is to automatically

learn a set of rules from a set of documents where the sentence

breaks are pre-marked.

Languages like Japanese and Chinese

have unambiguous sentence-ending markers.

The standard 'vanilla' approach to

locate the end of a sentence:

(a) If

it's a period,

it ends a sentence.

(b) If the preceding

token is on my hand-compiled list of abbreviations, then

it doesn't end a sentence.

(c) If the next

token is capitalized, then

it ends a sentence.

This

strategy gets about 95% of sentences correct.[2]

Solutions have been based on a maximum entropy model.[3]

The SATZ architecture uses a neural network to

disambiguate sentence boundaries and achieves 98.5% accuracy.

(我不确定我是否正确拆分了它。)

如果无法分割独立子句，是否可以使用任何搜索词来进一步探索该主题？

谢谢。

最佳答案

据我所知，没有现成的工具可以解决这个确切的问题。通常，NLP 系统不会遇到识别英语语法定义的不同类型的句子和从句的问题。 EMNLP 上发表了一篇论文，它提供了一种使用 SBAR 的算法。在解析树中标记以识别句子中的独立子句和从属子句。

你应该找到 section 3 of this paper有用。它详细讨论了英语语法，但我认为整篇论文与您的问题无关。

请注意，他们使用了伯克利解析器( demo available here )，但您显然可以使用任何其他选区解析工具(例如斯坦福解析器 demo available here )。

关于nlp - 独立子句边界消歧和独立子句分割——有什么工具可以做到这一点？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23859892/

文章推荐： java - 如何在 Spring Boot @Async 中使用 ForkJoinPool？

文章推荐： java - 从字节码解析类名

行者123

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城

nlp - 独立子句边界消歧和独立子句分割——有什么工具可以做到这一点？