gpt4 book ai didi

python - 自动更正python列表中的单词

转载 作者:太空宇宙 更新时间:2023-11-03 13:55:44 29 4
gpt4 key购买 nike

我想自动更正我的列表中的单词。

假设我有一个列表

kw = ['tiger','lion','elephant','black cat','dog']

我想检查这些词是否出现在我的句子中。如果它们拼写错误,我想更正它们。除了给定的列表,我不打算触及其他词。

现在我有了 str

的列表
s = ["I saw a tyger","There are 2 lyons","I mispelled Kat","bulldogs"]

预期输出:

['tiger','lion',None,'dog']

我的努力:

import difflib

op = [difflib.get_close_matches(i,kw,cutoff=0.5) for i in s]
print(op)

我的输出:

[[], [], [], ['dog']]

上面代码的问题是我想比较整个句子,而我的 kw 列表可以有超过 1 个单词(最多 4-5 个单词)。

如果我降低 cutoff 值,它会开始返回不应该返回的单词。

因此,即使我打算从给定的句子中创建二元组、三元组,也会消耗大量时间。

那么有没有办法实现呢?

我探索了更多的库,如 autocorrecthunspell 等,但没有成功。

最佳答案

您可以根据levenshtein 距离 实现一些东西。

值得注意的是 elasticsearch 的实现:https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzziness.html

Clearly, bieber is a long way from beaver—they are too far apart to be considered a simple misspelling. Damerau observed that 80% of human misspellings have an edit distance of 1. In other words, 80% of misspellings could be corrected with a single edit to the original string.

Elasticsearch supports a maximum edit distance, specified with the fuzziness parameter, of 2.

Of course, the impact that a single edit has on a string depends on the length of the string. Two edits to the word hat can produce mad, so allowing two edits on a string of length 3 is overkill. The fuzziness parameter can be set to AUTO, which results in the following maximum edit distances:

0 for strings of one or two characters

1 for strings of three, four, or five characters

2 for strings of more than five characters

我自己喜欢使用 pyxDamerauLevenshtein。

pip install pyxDamerauLevenshtein

所以你可以做一个简单的实现,比如:

keywords = ['tiger','lion','elephant','black cat','dog']    

from pyxdameraulevenshtein import damerau_levenshtein_distance


def correct_sentence(sentence):
new_sentence = []
for word in sentence.split():
budget = 2
n = len(word)
if n < 3:
budget = 0
elif 3 <= n < 6:
budget = 1
if budget:
for keyword in keywords:
if damerau_levenshtein_distance(word, keyword) <= budget:
new_sentence.append(keyword)
break
else:
new_sentence.append(word)
else:
new_sentence.append(word)
return " ".join(new_sentence)

只要确保您使用更好的分词器,否则这会变得困惑,但您明白了。另请注意,这是未优化的,并且使用很多关键字会非常慢。您应该实现某种分桶以不匹配所有关键字的所有单词。

关于python - 自动更正python列表中的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55921804/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com