from nltk.corpus import words > words.words().__contains__("ch-6ren">
gpt4 book ai didi

python - nltk 词语料库不包含 "okay"?

转载 作者:太空狗 更新时间:2023-10-29 20:19:42 26 4
gpt4 key购买 nike

NLTK 词语料库中没有短语“okay”、“okay”、“Okay”?

> from nltk.corpus import words
> words.words().__contains__("check")
> True

> words.words().__contains__("okay")
> False

> len(words.words())
> 236736

有什么想法吗?

最佳答案

长话短说

from nltk.corpus import words
from nltk.corpus import wordnet

manywords = words.words() + wordnet.words()

在龙

来自docs , nltk.corpus.words 是来自 "http://en.wikipedia.org/wiki/Words_(Unix)

的单词列表

在 Unix 中,你可以这样做:

ls /usr/share/dict/

阅读自述文件:

$ cd /usr/share/dict/
/usr/share/dict$ cat README
# @(#)README 8.1 (Berkeley) 6/5/93
# $FreeBSD$

WEB ---- (introduction provided by jaw@riacs) -------------------------

Welcome to web2 (Webster's Second International) all 234,936 words worth.
The 1934 copyright has lapsed, according to the supplier. The
supplemental 'web2a' list contains hyphenated terms as well as assorted
noun and adverbial phrases. The wordlist makes a dandy 'grep' victim.

-- James A. Woods {ihnp4,hplabs}!ames!jaw (or jaw@riacs)

Country names are stored in the file /usr/share/misc/iso3166.


FreeBSD Maintenance Notes ---------------------------------------------

Note that FreeBSD is not maintaining a historical document, we're
maintaining a list of current [American] English spellings.

A few words have been removed because their spellings have depreciated.
This list of words includes:
corelation (and its derivatives) "correlation" is the preferred spelling
freen typographical error in original file
freend archaic spelling no longer in use;
masks common typo in modern text

--

A list of technical terms has been added in the file 'freebsd'. This
word list contains FreeBSD/Unix lexicon that is used by the system
documentation. It makes a great ispell(1) personal dictionary to
supplement the standard English language dictionary.

因为它是一个固定的 234,936 列表,所以肯定会有不存在于该列表中的单词。

如果您需要扩展您的单词列表,您可以使用 nltk.corpus.wordnet.words() 使用来自 WordNet 的单词添加到列表中。

很可能,您所需要的只是足够大的文本语料库,例如维基百科转储然后对其进行标记化并提取所有独特的单词。

关于python - nltk 词语料库不包含 "okay"?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44449284/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com