gpt4 book ai didi

python - PyEnchant 'correcting' 字典中的单词到字典中没有的单词

转载 作者:行者123 更新时间:2023-12-01 05:26:57 24 4
gpt4 key购买 nike

我正在尝试从网络论坛获取大量自然语言,并使用 PyEnchant 更正拼写。这些文本通常是非正式的,并且涉及医疗问题,因此我创建了一个文本文件“test.pwl”,其中包含相关的医学单词、聊天缩写等。在某些情况下,不幸的是,少量的 html、url 等仍然保留在其中。

我的脚本设计为使用 en_US 词典和 PWL 来查找所有拼写错误的单词,并完全自动将它们纠正为 d.suggest 的第一个建议。它打印拼写错误的单词列表,然后打印没有建议的单词列表,并将更正的文本写入“spellfixed.txt”:

import enchant
import codecs

def spellcheckfile(filepath):
d = enchant.DictWithPWL("en_US","test.pwl")
try:
f = codecs.open(filepath, "r", "utf-8")
except IOError:
print "Error reading the file, right filepath?"
return
textdata = f.read()
mispelled = []
words = textdata.split()
for word in words:
# if spell check failed and the word is also not in
# mis-spelled list already, then add the word
if d.check(word) == False and word not in mispelled:
mispelled.append(word)
print mispelled
for mspellword in mispelled:
#get suggestions
suggestions=d.suggest(mspellword)
#make sure we actually got some
if len(suggestions) > 0:
# pick the first one
picksuggestion=suggestions[0]
else: print mspellword
#replace every occurence of the bad word with the suggestion
#this is almost certainly a bad idea :)
textdata = textdata.replace(mspellword,picksuggestion)
try:
fo=open("spellfixed.txt","w")
except IOError:
print "Error writing spellfixed.txt to current directory. Who knows why."
return
fo.write(textdata.encode("UTF-8"))
fo.close()
return

问题是输出通常包含字典或 pwl 中单词的“更正”。例如,当输入的第一部分是:

My NEW doctor feels that I am now bi-polar . This , after 9 years of being considered majorly depressed by everyone else

我得到了这个:

My NEW dotor feels that I am now bipolar . This , aftER 9 years of being considERed majorly depressed by evERyone else

我可以处理病例变更,但医生 --> 医生一点也不好。当输入很短时(例如,上面的引用是整个输入),结果是理想的:

My NEW doctor feels that I am now bipolar . This , after 9 years of being considered majorly depressed by everyone else

谁能给我解释一下为什么吗?请用非常简单的术语来说,因为我对编程非常陌生,而且对 Python 也很陌生。如果有分步解决方案,我们将不胜感激。

最佳答案

我认为你的问题是你正在替换单词内部的字母序列。 “ER”可能是“er”的有效拼写更正,但这并不意味着您应该将“considered”更改为“considERed”。

您可以使用正则表达式代替简单的文本替换,以确保仅替换完整单词。正则表达式中的“\b”表示“单词边界”:

>>> "considered at the er".replace( "er", "ER" )
'considERed at the ER'
>>> import re
>>> re.sub( "\\b" + "er" + "\\b", "ER", "considered at the er" )
'considered at the ER'

关于python - PyEnchant 'correcting' 字典中的单词到字典中没有的单词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21161853/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com