gpt4 book ai didi

machine-learning - 如何编写程序来查找某些单词是否相似?

转载 作者:行者123 更新时间:2023-11-30 08:27:13 25 4
gpt4 key购买 nike

即:“college”和“schoolwork”和“academy”属于同一个集群,“essay”、“scholarships”、“money”这些词也属于同一簇。这是 ML 或 NLP 问题吗?

最佳答案

这取决于您对相似的定义有多严格。

机器学习技术

others已经指出,您可以使用类似 latent semantic analysis 的内容或相关latent Dirichlet allocation .

语义相似性和 WordNet

原样pointed out ,您可能希望使用现有资源来完成类似的事情。

许多研究论文 ( example ) 使用术语语义相似性。基本思想是计算,这通常是通过查找 distance 来完成的。图表上的两个单词之间,如果一个单词是其父单词的类型,则该单词是子单词。示例:“songbird”将是“bird”的子项。如果您愿意,语义相似性可以用作创建集群的距离度量。

示例实现

此外,如果您对某些语义相似性度量的值设置阈值,则可以获得 bool 值 TrueFalse。这是我创建的 Gist ( word_similarity.py ),它使用 NLTK's WordNet 的语料库阅读器。希望这能为您指明正确的方向,并为您提供更多搜索词。

def sim(word1, word2, lch_threshold=2.15, verbose=False):
"""Determine if two (already lemmatized) words are similar or not.

Call with verbose=True to print the WordNet senses from each word
that are considered similar.

The documentation for the NLTK WordNet Interface is available here:
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
"""
from nltk.corpus import wordnet as wn
results = []
for net1 in wn.synsets(word1):
for net2 in wn.synsets(word2):
try:
lch = net1.lch_similarity(net2)
except:
continue
# The value to compare the LCH to was found empirically.
# (The value is very application dependent. Experiment!)
if lch >= lch_threshold:
results.append((net1, net2))
if not results:
return False
if verbose:
for net1, net2 in results:
print net1
print net1.definition
print net2
print net2.definition
print 'path similarity:'
print net1.path_similarity(net2)
print 'lch similarity:'
print net1.lch_similarity(net2)
print 'wup similarity:'
print net1.wup_similarity(net2)
print '-' * 79
return True
输出示例
>>> sim('college', 'academy')
True

>>> sim('essay', 'schoolwork')
False

>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True

>>> sim('human', 'man')
True

>>> sim('human', 'car')
False

>>> sim('fare', 'food')
True

>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True

>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True

>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True

关于machine-learning - 如何编写程序来查找某些单词是否相似?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/14148986/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com