gpt4 book ai didi

python - 根据单词得分对句子进行评分

转载 作者:太空宇宙 更新时间:2023-11-03 20:26:47 24 4
gpt4 key购买 nike

我有一个关于狗的大型论坛,其中带有标记的帖子。文档频率 * 文本频率的索引分数让我可以完美地衡量主题的内容。例如

print (getscores('dog food'))
# keyword scores range between 1 and 2
# {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5, ..... 'like':1.00001}

从那里似乎很容易对句子进行评分并找到最能代表主题的句子,至少我是这么认为的。在这个例子中,第二句话非常合适。

def method1 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score

def method2 (sen):
score = 1
for word in sen.split():
score=score*scores.get(word,1)
return score / len(sen.split())

scores = {'dog':2,'food':1.8,'bowl':1.7,'consumption':1.5,'intended':1.4}
sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']


for sen in sens:
print (sen)
print (method1(sen))
print (method2(sen))

#dog food
#3.6
#1.8 (winner method 2)
#dog food is food intended for consumption by dogs
#13.607999999999999
#1.5119999999999998
#like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty
#22.032220320000004 (winner method 1)
#0.7868650114285716

平均分数将有利于短句子,而添加分数将有利于长句子。补偿句子长度(每个单词乘以 0.92 左右)对于一个主题有效,但对于下一个主题则需要另一个因素。

所以这种方法不会让我有任何结果。是否有任何已知的句子评分方法可以为我提供关键字权重最高的句子,但同时也会考虑关键字密度和句子长度?

最佳答案

如果您使用Multi-word expressions (MWE)在您的处理管道中,您的结果可能会有所改善。此预处理通常在 TfIdf 步骤之前完成。下面的代码说明了如何使用它们:

from nltk.tokenize import MWETokenizer

#Instantiate the tokenizer with a list of NWEs:
tokenizer = MWETokenizer( [('dog', 'food'), ('band', 'camp')])

tl1 = tokenizer.tokenize('dog food is food intended for consumption by dogs'.split())
print(tl1)
tl2 = tokenizer.tokenize('like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty'.split())
print(tl2)

#['dog_food', 'is', 'food', 'intended', 'for', 'consumption', 'by', 'dogs']
#['like', 'this', 'one', 'time', 'at', 'band_camp', 'there', 'was', 'all', 'this', 'food', 'and', 'and', 'a', 'dog', 'this', 'dog', 'who', 'ate', 'all', 'the', 'food', 'and', 'then', 'my', 'bowl', 'was', 'empty']

Spacy依存解析器和词性标注器对于提取此类 MWE 非常有用。

下面的示例将检测一些可能是 MWE 的复合名词:

import spacy
nlp = spacy.load('en_core_web_sm')

sens = ['dog food','dog food is food intended for consumption by dogs','like this one time at band camp there was all this food and and a dog this dog who ate all the food and then my bowl was empty']

def getCompoundNouns(sentence):
doc = nlp(sentence)
answer = []
for t in doc:
if t.dep_ == 'compound' and t.pos_ == 'NOUN':
neighboringToken = t.nbor()
if neighboringToken.pos_ == 'NOUN':
answer.append((t.text, t.nbor()))
if not answer:
return(None)
return(answer)

for s in sens:
print(getCompoundNouns(s))

#[('dog', food)]
#[('dog', food)]
#[('band', camp)]

关于python - 根据单词得分对句子进行评分,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57759551/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com