gpt4 book ai didi

python - 潜在语义分析 (LSA) 教程

转载 作者:太空狗 更新时间:2023-10-30 01:34:30 32 4
gpt4 key购买 nike

我正在尝试使用此链接中的 LSA 教程(编辑:2017 年 7 月。删除死链接)

教程代码如下:

titles = [doc1,doc2]
stopwords = ['and','edition','for','in','little','of','the','to']
ignorechars = ''',:'!'''

class LSA(object):
def __init__(self, stopwords, ignorechars):
self.stopwords = open('stop words.txt', 'r').read()
self.ignorechars = ignorechars
self.wdict = {}
self.dcount = 0
def parse(self, doc):
words = doc.split();
for w in words:
w = w.lower()
if w in self.stopwords:
continue
elif w in self.wdict:
self.wdict[w].append(self.dcount)
else:
self.wdict[w] = [self.dcount]
self.dcount += 1
def build(self):
self.keys = [k for k in self.wdict.keys() if len(self.wdict[k]) > 1]
self.keys.sort()
self.A = zeros([len(self.keys), self.dcount])
for i, k in enumerate(self.keys):
for d in self.wdict[k]:
self.A[i,d] += 1
def calc(self):
self.U, self.S, self.Vt = svd(self.A)
def TFIDF(self):
WordsPerDoc = sum(self.A, axis=0)
DocsPerWord = sum(asarray(self.A > 0, 'i'), axis=1)
rows, cols = self.A.shape
for i in range(rows):
for j in range(cols):
self.A[i,j] = (self.A[i,j] / WordsPerDoc[j]) * log(float(cols) / DocsPerWord[i])
def printA(self):
print 'Here is the count matrix'
print self.A
def printSVD(self):
print 'Here are the singular values'
print self.S
print 'Here are the first 3 columns of the U matrix'
print -1*self.U[:, 0:3]
print 'Here are the first 3 rows of the Vt matrix'
print -1*self.Vt[0:3, :]

mylsa = LSA(stopwords, ignorechars)
for t in titles:
mylsa.parse(t)
mylsa.build()
mylsa.printA()
mylsa.calc()
mylsa.printSVD()

我读了又读,但我想不通。如果我执行代码,结果将如下

Here are the singular values
[ 4.28485706e+01 3.36652135e-14]
Here are the first 3 columns of the U matrix
[[ 3.30049181e-02 -9.99311821e-01 7.14336493e-04]
[ 6.60098362e-02 1.43697129e-03 6.53394384e-02]
[ 6.60098362e-02 1.43697129e-03 -9.95952378e-01]
...,
[ 3.30049181e-02 7.18485644e-04 2.02381089e-03]
[ 9.90147543e-02 6.81929920e-03 6.35728804e-03]
[ 3.30049181e-02 7.18485644e-04 2.02381089e-03]]
Here are the first 3 rows of the Vt matrix
array([[ 0.5015178 , 0.86514732],
[-0.86514732, 0.5015178 ]])

我如何从这些矩阵中计算出 doc1 和 doc2 的相似性?在我自己编写的 tfidf 算法中,我得到了一个简单的 float 和 3 个矩阵。有什么建议吗?

最佳答案

一种选择是在两个矩阵之间运行余​​弦相似度。我想您会发现我之前发布的相关有用信息。我也发布了问题的答案,我看到其他人也给出了很好的答案。

Python: tf-idf-cosine: to find document similarity

关于python - 潜在语义分析 (LSA) 教程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/18439316/

32 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com