gpt4 book ai didi

python - 有没有一种简单的方法可以从 python 中的无间隔句子生成可能的单词列表?

转载 作者:太空狗 更新时间:2023-10-29 18:29:49 25 4
gpt4 key购买 nike

我有一些文字:

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想将其解析为单独的单词。我很快查看了 enchant 和 nltk,但没有看到任何看起来立即有用的东西。如果我有时间投资于此,我会研究编写一个具有附魔能力的动态程序来检查一个单词是否是英语。我原以为可以在线进行此操作,我错了吗?

最佳答案

使用 trie 的贪心方法

尝试使用 Biopython (pip 安装 biopython):

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
tr = trie.trie()
with open(dictfile) as f:
for line in f:
word = line.rstrip()
try:
word = word.encode(encoding='ascii', errors='ignore')
tr[word] = len(word)
assert tr.has_key(word), "Missing %s" % word
except UnicodeDecodeError:
pass
return tr


def get_trie_word(tr, s):
for end in reversed(range(len(s))):
word = s[:end + 1]
if tr.has_key(word):
return word, s[end + 1: ]
return None, s

def main(s):
tr = get_trie()
while s:
word, s = get_trie_word(tr, s)
print word

if __name__ == '__main__':
s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
s = s.strip(string.punctuation)
s = s.replace(" ", '')
s = s.lower()
main(s)

结果

>>> if __name__ == '__main__':
... s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
... s = s.strip(string.punctuation)
... s = s.replace(" ", '')
... s = s.lower()
... main(s)
...
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

注意事项

在英语中存在退化的情况,这将不起作用。您需要使用回溯来处理这些问题,但这应该能让您入门。

强制性测试

>>> main("expertsexchange")
experts
exchange

关于python - 有没有一种简单的方法可以从 python 中的无间隔句子生成可能的单词列表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15364975/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com