python - 如何使用 gensim.similarities.Similarity 找到两个句子之间的相似性-6ren

python - 如何使用 gensim.similarities.Similarity 找到两个句子之间的相似性

转载作者：行者123 更新时间：2023-12-02 01:00:02

29

4

我想编写代码来查找两个句子之间的相似性，然后我最终使用 nltk 和 gensim 编写了这段代码。我使用标记化和 gensim.similarities.Similarity 来完成这项工作。但这不符合我的目的。在我介绍最后一行代码之前，它工作正常。

import gensim
import nltk

raw_documents = ["I'm taking the show on the road.",
             "My socks are a force multiplier.",
         "I am the barber who cuts everyone's hair who doesn't cut their 
own.",
         "Legend has it that the mind is a mad monkey.",
        "I make my own fun."]
from nltk.tokenize import word_tokenize
gen_docs = [[w.lower() for w in word_tokenize(text)]
        for text in raw_documents]



dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary[5])
print(dictionary.token2id['socks'])
print("Number of words in dictionary:",len(dictionary))
for i in range(len(dictionary)):
    print(i, dictionary[i])

corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
print(corpus)

tf_idf = gensim.models.TfidfModel(corpus)
print(tf_idf)
    s = 0
for i in corpus:
s += len(i)
print(s)

sims = gensim.similarities.Similarity('/usr/workdir/',tf_idf[corpus],
                                  num_features=len(dictionary))
print(sims)
print(type(sims))


query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

sims[query_doc_tf_idf]

它抛出这个错误。我在互联网上的任何地方都找不到这个问题的答案。

Traceback (most recent call last):
  File "C:\Python36\lib\site-packages\gensim\utils.py", line 679, in save
_pickle.dump(self, fname_or_handle, protocol=pickle_protocol)
TypeError: file must have a 'write' attribute

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "semantic.py", line 45, in <module>
    sims[query_doc_tf_idf]
  File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line 
503, in __getitem__
    self.close_shard()  # no-op if no documents added to index since last 
query
 File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line 
427, in close_shard
    shard = Shard(self.shardid2filename(shardid), index)
 File "C:\Python36\lib\site-packages\gensim\similarities\docsim.py", line 
110, in __init__
    index.save(self.fullname())
  File "C:\Python36\lib\site-packages\gensim\utils.py", line 682, in save
    self._smart_save(fname_or_handle, separately, sep_limit, ignore, 
pickle_protocol=pickle_protocol)
  File "C:\Python36\lib\site-packages\gensim\utils.py", line 538, in 
_smart_save
    pickle(self, fname, protocol=pickle_protocol)
  File "C:\Python36\lib\site-packages\gensim\utils.py", line 1337, in pickle
    with smart_open(fname, 'wb') as fout:  # 'b' for binary, needed on 
Windows
  File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line 
181, in smart_open
fobj = _shortcut_open(uri, mode, **kw)
  File "C:\Python36\lib\site-packages\smart_open\smart_open_lib.py", line 
287, in _shortcut_open
return io.open(parsed_uri.uri_path, mode, **open_kwargs)

请帮忙找出问题所在

最佳答案

如果您在实例化 Similarity 时指定了有效路径，则您的查询应该有效。对于下面的示例，我在我的 C 盘上创建了一个目录 Similarity，并在函数调用中指定了目录路径和文件名。

sims = gensim.similarities.Similarity('C:/Similarity/sims',tf_idf[corpus],
                                  num_features=len(dictionary))
print(sims)
print(type(sims))

query_doc = [w.lower() for w in word_tokenize("Socks are a force for good.")]
print(query_doc)
query_doc_bow = dictionary.doc2bow(query_doc)
print(query_doc_bow)
query_doc_tf_idf = tf_idf[query_doc_bow]
print(query_doc_tf_idf)

print('Query result:', sims[query_doc_tf_idf])

Query result: [0.       0.84565616   0.      0.06124881   0.        ]

关于python - 如何使用 gensim.similarities.Similarity 找到两个句子之间的相似性，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51287590/

29

4

0

文章推荐： firebase - 如何向之前创建的文档添加字段？

文章推荐： Ansible/神社 : how to convert number to binary format

文章推荐： R - 导入一个奇怪的 CSV 文件

文章推荐： r - 有没有办法在 Shiny 链接中打开用户的小插图？

java - Java 中的 Wordnet 相似性:JAWS、JWNL 或 Java WN::相似性？
我需要在基于 Java 的应用程序中使用 Wordnet。我想: 搜索同义词集找到同义词集之间的相似性/相关性我的应用程序使用 RDF 图，我知道 Wordnet 有 SPARQL 端点，但我想最
C# 搜索具有相似性/相似性
假设我们有一个 IEnumerable Collection，其中包含 20 000 人对象项。那么假设我们创建了另一个 Person 对象。我们想列出所有与这个人相似的人。这意味着，例如，如果姓
java - JAWS Wordnet 相似性
我使用 JAWS 作为普通的 wordnet 来查找单词之间的相似性。我安装了 wordnet 2.1 并添加了 jar 文件:edu.mit.jwi_2.1.4.jar 和 edu.sussex.
python - Word2Vec Python 相似性
我用这段代码做了一个词嵌入: with open("text.txt",'r') as longFile: sentences = [] single= []
javascript - 对象/数组比较算法以确定共性/相似性
我正在尝试找出确定各种对象或数组之间的共性或相似性的最佳方法，并且有兴趣获得社区的意见。我目前正在用 javascript 构建一个早期研究原型(prototype)，我需要采用一种巧妙的方式来比较对
c# - C# 上的 Flash 相似性
我在将 Flash 游戏转换为 C# 时遇到问题。在 Flash 中我会使用这种语法: public function doMove() { eaze(this).to(actionTime,
python - 在 PyTorch 中找到一批向量之间的 jaccard 相似性
我有一批形状为 (bs, m, n) 的向量(即维度为 mxn 的 bs 向量)。对于每个批处理，我想计算第一个向量与其余 (m-1) 个向量的 Jaccard 相似度例子: a = [ [
python - 使用 Whoosh Python 搜索库的文档比较/相似性
如何使用 Whoosh 获取文档的相似性度量？我想创建一个“相关”特征，对与文档具有高度相似性的其他先前编入索引的文档进行排名。我是否将文档作为长查询字符串输入？我是否将文档添加到索引并以某种方式
python - 比较多个 Python 列表并合并 Levenshtein 相似性
我编写了一个 Python 函数，它接受两个列表，使用 Levenshtein 比较它们并将足够相似的单词合并到一个名为“merged”的列表中。我如何为超过 6 个列表执行此操作？确保将每个列表与
c++ - 在 C++ 中使用枚举编程 iota 相似性
请原谅我对 Go 的了解非常有限。我有这样的定义 type ErrorVal int const ( LEV_ERROR ErrorVal = iota LEV_WARNING
python - 如何比较两个大文本之间的度量 - Python 中的余弦、Jaccard 相似性、Sim_MinEdit (Sim_String) 和 Sim_Simple
我正在从事文本分析项目，一次比较两个不同的报告并将结果保存到 pandas 数据框中。我能够得到 cosine 和 jacard 的相似性，但需要确保我得到正确的度量。作为参数，我使用位于给定文件夹

首页

博学

6Ren·AI

商城

python - 如何使用 gensim.similarities.Similarity 找到两个句子之间的相似性