gpt4 book ai didi

python - 无意义的空间名词

转载 作者:行者123 更新时间:2023-12-04 02:27:10 28 4
gpt4 key购买 nike

我正在使用 Spacy 从句子中提取名词。这些句子在语法上很差,也可能包含一些拼写错误。
这是我正在使用的代码:
代码

import spacy
import re

nlp = spacy.load("en_core_web_sm")

sentence= "HANDBRAKE - slow and fast (SFX)"
string= sentence.lower()
cleanString = re.sub('\W+',' ', string )
cleanString=cleanString.replace("_", " ")

doc= nlp(cleanString)

for token in doc:
if token.pos_=="NOUN":
print (token.text)

输出:
sfx
同样对于句子“fast foward2”,我得到 Spacy 名词为
foward2
这表明这些名词有一些无意义的词,例如:sfx、foward2、ms、64x、bit、pwm、r、brailledisplayfastmovement 等。
我只想保留包含合理的单词名词的短语,如 broom、ticker、pool、highway 等。
我尝试过 Wordnet 过滤 wordnet 和 spacy 之间的常用名词,但它有点严格,并且还过滤了一些合理的名词。例如,它过滤了摩托车、whoosh、手推车、金属、手提箱、 zipper 等名词
因此,我正在寻找一种解决方案,在该解决方案中,我可以从我获得的 spacy 名词列表中过滤掉最合理的名词。

最佳答案

看来你可以用 pyenchant library :

Enchant is used to check the spelling of words and suggest corrections for words that are miss-spelled. It can use many popular spellchecking packages to perform this task, including ispell, aspell and MySpell. It is quite flexible at handling multiple dictionaries and multiple languages.

More information is available on the Enchant website:

https://abiword.github.io/enchant/


示例 Python 代码:
import spacy, re
import enchant #pip install pyenchant

d = enchant.Dict("en_US")
nlp = spacy.load("en_core_web_sm")

sentence = "For example, it filters nouns like motorbike, whoosh, trolley, metal, suitcase, zip etc"
cleanString = re.sub('[\W_]+',' ', sentence.lower()) # Merging \W and _ into one regex

doc= nlp(cleanString)
for token in doc:
if token.pos_=="NOUN" and d.check(token.text):
print (token.text)
# => [example, nouns, motorbike, whoosh, trolley, metal, suitcase, zip]

关于python - 无意义的空间名词,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66751457/

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com