- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我记得很久以前浏览过 NLTK 站点的句子分割部分。
我使用“句号”“空格”与“句号”“手动换行符”的粗文本替换来实现句子分割,例如使用 Microsoft Word 替换( .
-> .^p
)或 Chrome 扩展:
https://github.com/AhmadHassanAwan/Sentence-Segmentation
https://chrome.google.com/webstore/detail/sentence-segmenter/jfbhkblbhhigbgdnijncccdndhbflcha
这不是像 NLTK 的 Punkt tokenizer 那样的 NLP 方法。
我分段以帮助我更轻松地定位和重读句子,这有时有助于阅读理解。
独立子句边界消歧和独立子句分割怎么样?是否有任何工具试图做到这一点?
下面是一些示例文本。如果可以在一个句子中识别出一个独立的子句,那就是 split 了。从句尾开始,向左移动,贪婪地 split :
例如。
Sentence boundary disambiguation (SBD), also known as sentence breaking, is the problem in natural language processing of deciding where
sentences begin and end.
Often, natural language processing tools
require their input to be divided into sentences for a number of reasons.
However, sentence boundary identification is challenging because punctuation
marks are often ambiguous.
For example, a period may
denote an abbreviation, decimal point, an ellipsis, or an email address - not the end of a sentence.
About 47% of the periods in the Wall Street Journal corpus
denote abbreviations.[1]
As well, question marks and exclamation marks may
appear in embedded quotations, emoticons, computer code, and slang.
Another approach is to automatically
learn a set of rules from a set of documents where the sentence
breaks are pre-marked.
Languages like Japanese and Chinese
have unambiguous sentence-ending markers.
The standard 'vanilla' approach to
locate the end of a sentence:
(a) If
it's a period,
it ends a sentence.
(b) If the preceding
token is on my hand-compiled list of abbreviations, then
it doesn't end a sentence.
(c) If the next
token is capitalized, then
it ends a sentence.
This
strategy gets about 95% of sentences correct.[2]
Solutions have been based on a maximum entropy model.[3]
The SATZ architecture uses a neural network to
disambiguate sentence boundaries and achieves 98.5% accuracy.
最佳答案
据我所知,没有现成的工具可以解决这个确切的问题。通常,NLP 系统不会遇到识别英语语法定义的不同类型的句子和从句的问题。 EMNLP 上发表了一篇论文,它提供了一种使用 SBAR
的算法。在解析树中标记以识别句子中的独立子句和从属子句。
你应该找到 section 3 of this paper有用。它详细讨论了英语语法,但我认为整篇论文与您的问题无关。
请注意,他们使用了伯克利解析器( demo available here ),但您显然可以使用任何其他选区解析工具(例如斯坦福解析器 demo available here )。
关于nlp - 独立子句边界消歧和独立子句分割——有什么工具可以做到这一点?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23859892/
我是一名优秀的程序员,十分优秀!