R:计算余弦相似度的正确方法？-6ren

R:计算余弦相似度的正确方法？

转载作者：行者123 更新时间：2023-12-02 18:12:50

26

4

我正在使用 R 编程语言。

我有以下数据:

text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.", 
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.", 
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.", 
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.", 
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!", 
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))

我想计算每对元素之间的余弦相似度矩阵:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, 
    control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm, 
    MARGIN = 1, 
    FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

查看结果矩阵时:

 distMatrix
             1           2         3            4          5          6            7           8
1 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
2 6.362649e-03 0.000000000 0.1904180 6.362649e-03 0.11468650 0.33082042 5.505664e-03 0.019623883
3 2.616818e-01 0.190417963 0.0000000 2.616818e-01 0.55622109 0.89444938 2.563879e-01 0.322025370
4 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
5 6.794855e-02 0.114686503 0.5562211 6.794855e-02 0.00000000 0.06202843 7.083380e-02 0.040392530
6 2.513851e-01 0.330820421 0.8944494 2.513851e-01 0.06202843 0.00000000 2.566349e-01 0.197460291
7 3.107289e-05 0.005505664 0.2563879 3.107289e-05 0.07083380 0.25663492 0.000000e+00 0.004363538
8 3.658840e-03 0.019623883 0.3220254 3.658840e-03 0.04039253 0.19746029 4.363538e-03 0.000000000

我的问题:我是否正确计算了余弦相似度？还有其他方法可以做到这一点吗？

谢谢!

引用文献:

最佳答案

首先，我建议使用 cosine 而不是 1-cosine，因为这样读起来更容易。使用您的代码计算余弦相似度:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus, 
                          control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm, 
                    MARGIN = 1, 
                    FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

distMatrix <- cosine(lsaMatrix)
round(distMatrix, 3)

输出:

      1     2     3     4     5     6     7     8
1 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
2 0.994 1.000 0.810 0.994 0.885 0.669 0.994 0.980
3 0.738 0.810 1.000 0.738 0.444 0.106 0.744 0.678
4 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
5 0.932 0.885 0.444 0.932 1.000 0.938 0.929 0.960
6 0.749 0.669 0.106 0.749 0.938 1.000 0.743 0.803
7 1.000 0.994 0.744 1.000 0.929 0.743 1.000 0.996
8 0.996 0.980 0.678 0.996 0.960 0.803 0.996 1.000

你的矩阵看起来不错。您可以看到，当值为 1 时，文档相似，而值为 0 时则不相似。要检查相似性是否正确，可以使用 stringdist 包中的 stringsim 函数。让我们使用以下代码比较数据集中的评论 1 和 4:

library(stringdist)
stringsim(text[1,2], text[4,2], method = "cosine")

输出:

[1] 1

如您所见，输出为 1，与您的矩阵相同。

关于R:计算余弦相似度的正确方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72037888/

26

4

0

文章推荐： eclipse - 安装 ColdFusion 8 Eclipse 扩展后，RDS 配置不可用

文章推荐： krl - 在函数内构建哈希

javascript - 如果输入 a 或 b 正确/正确，我如何执行操作？
这个问题已经有答案了: How to do case insensitive string comparison? (23 个回答) 已关闭 3 年前。用户在我的输入栏中写入“足球”，然后执行第 6
javascript - 字符 id= + 是 + 正确= + 正确不正确...我怎样才能使它成为 javascript 中的字符串
啊，不习惯 javascript 中的字符串。 character_id= + id + correct= + correctOrIncorrect 这就是我需要制作成字符串的内容。如果您无法猜测字符
javascript - jQuery计算价格不起作用(正确)
$(function() { var base_price = 0; CalculatePrice(); $(".math1").on('change', function(e) { Calc
kubernetes - 将Spinnaker部署到Spinnaker将管理的同一kubernetes集群是否安全/正确？
我找不到任何文章回答问题:将Spinnaker部署到Spinnaker将管理的同一Kubernetes集群是否安全/正确？我主要是指生产，HA部署。最佳答案我认为Spinnaker和Kuberne
c++ - 正确/快速的方法来更改命令行Qt5源内部版本的配置
我正在使用MSVC在Windows上从源代码(官方源代码发布，而不是从仓库中)构建Qt5(Qt 5.15.0)。我正在设置环境。变量，依赖项等，然后运行具有1600万个选项的configure，最后
java - 计数时数组越界[正确]
我需要打印一个包含重复单词的数组。我的数组已经可以工作，但我不知道如何正确计算单词数。我已经知道，当我的索引计数器 (i) 为 49 时，并且当 (i) 想要计数到 50 时，我会收到错误，但我不知道
javascript - 正确/错误取决于屏幕尺寸动态？
我正在遵循一个指南，该指南允许 Google map 屏幕根据屏幕尺寸禁用滚动。我唯一挣扎的部分是编写一个代码，当我手动调整屏幕大小时动态更改 True/False 值。这是我按照说明操作的网站，但
java - 未调用子类中的方法(正确)
我有一个类“FileButton”。它的目的是将文件链接到 JButton，FileButton 继承自 JButton。子类继承自此以使用链接到按钮的文件做有用的事情。 JingleCardButt
php - 如何仅显示来自好友列表的帖子。 (正确)
我的 friend 数组只返回一个数字而不是所有数字。 ($myfriends = 3) 应该是…… ($myfriends = 3 5 7 8 9 12). 如果我让它进入 while 循环……整个
html - 在这种情况下使用整数作为类名是否可以接受/正确
这个问题在这里已经有了答案: Is there a workaround to make CSS classes with names that start with numbers valid?
javascript - 在窗口更改时自动调整元素大小(正确)
我正在制作一个 JavaScript 函数，当调整窗口大小时，它会自动将 div 的大小调整为与窗口相同的宽度/高度。该功能非常基本，但我注意到在调整窗口大小时出现明显的“绘制”滞后。在 JS fi
javascript - 删除导航栏的类 - 正确
此问题的基本视觉效果可在 http://sevenx.de/demo/bootstrap-carousel/inc.carousel/tabbed-slider.html 获得。 - 如果你想看一看。
c - 从将其内存分配给同一函数的函数返回字符串是否安全/正确？
我明白，如果我想从函数返回一个字符串文字或一个数组，我应该将其声明为静态的，这样当被调用的函数被返回时，内容就不会“消亡”。但我的问题是，当我在函数内部使用 malloc 分配内存时会怎样？在下面
mysql - 正确/错误值的适当数据字段类型？
在 mySQL 数据库中存储 true/false/1/0 值最合适(读取数据消耗最少)的数据字段是什么？我以前使用过一个字符长的 tinyint，但我不确定它是否是最佳解决方案？谢谢! 最佳答案
c++ - 正确，有效地读取文件
我想一次读取并处理CSV文件第一行中的条目(例如打印)。我假设使用Unix风格的\n换行符，没有条目长度超过255个字符，并且(现在)在EOF之前有一个换行符。这意味着它是fgets()后跟strto
c++ - “正确”无符号整数比较
所以，我们都知道 -1 > 2u == true 的 C/C++ 有符号/无符号比较规则，并且我有一种情况，我想有效地实现“正确”比较。我的问题是，考虑到人们熟悉的尽可能多的架构，哪种方法更有效。显
Java异常处理：如何写出“正确”但被编译器认为有语法错误的程序
**摘要：**文章的标题看似自相矛盾。本文分享自华为云社区《Java异常处理：如何写出“正确”但被编译器认为有语法错误的程序》，作者： Jerry Wang 。文章的标题看似自相矛盾，然而我在“正
r - 进行按行替换的“正确”方法
我有一个数据框，看起来像: dataDemo % mutate_each(funs(ifelse(. == '.', REF, as.character(.))), -POS) # POS REF
text - VBScript 正确/重新格式化带分隔符的文本文件？
有人可以帮助我使用 VBScript 重新格式化/正确格式化带分隔符的文本文件吗？我有一个文本文件 ^分界如下: AGREE^NAME^ADD1^ADD2^ADD3^ADD4^PCODE^BAL^A
java - 语言认证以及诸如适当、正确、合法等术语的使用
就目前而言，这个问题不适合我们的问答形式。我们希望答案得到事实、引用或专业知识的支持，但这个问题可能会引起辩论、争论、投票或扩展讨论。如果您觉得这个问题可以改进并可能重新打开，visit the he

首页

博学

6Ren·AI

商城

R:计算余弦相似度的正确方法？