gpt4 book ai didi

Java简单句解析器

转载 作者:塔克拉玛干 更新时间:2023-11-01 21:36:46 24 4
gpt4 key购买 nike

有没有什么简单的方法可以用纯 Java 创建句子解析器?不添加任何库和 jar 。

解析器不应该只关心单词之间的空格,但要更聪明并解析: 。 ! ,识别句子何时结束等。

解析后,只有真正的单词可以全部存储在数据库或文件中,而不是任何特殊字符。

非常感谢大家:)

最佳答案

您可能想先查看 BreakIterator类。

来自 JavaDoc。

The BreakIterator class implements methods for finding the location of boundaries in text. Instances of BreakIterator maintain a current position and scan over text returning the index of characters where boundaries occur. Internally, BreakIterator scans text using a CharacterIterator, and is thus able to scan text held by any object implementing that protocol. A StringCharacterIterator is used to scan String objects passed to setText.

You use the factory methods provided by this class to create instances of various types of break iterators. In particular, use getWordIterator, getLineIterator, getSentenceIterator, and getCharacterIterator to create BreakIterators that perform word, line, sentence, and character boundary analysis respectively. A single BreakIterator can work only on one unit (word, line, sentence, and so on). You must use a different iterator for each unit boundary analysis you wish to perform.

Line boundary analysis determines where a text string can be broken when line-wrapping. The mechanism correctly handles punctuation and hyphenated words.

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word-breaks on both sides.

Character boundary analysis allows users to interact with characters as they expect to, for example, when moving the cursor through a text string. Character boundary analysis provides correct navigation of through character strings, regardless of how the character is stored. For example, an accented character might be stored as a base character and a diacritical mark. What users consider to be a character can differ between languages.

BreakIterator is intended for use with natural languages only. Do not use this class to tokenize a programming language.

查看演示:BreakIteratorDemo.java

关于Java简单句解析器,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2103598/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com