gpt4 book ai didi

R:计算余弦相似度的正确方法?

转载 作者:行者123 更新时间:2023-12-02 01:37:57 25 4
gpt4 key购买 nike

我正在使用 R 编程语言。

我有以下数据:

text = structure(list(id = 1:8, reviews = c("I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.", 
"I went to McDonald's and they charge me 50 for Big Mac when I only came with 49. The casher told me that I can't read correctly and told me to get glasses. I am file a report on your casher and now I'm mad.",
"I really think that if you can buy breakfast anytime then I should be able to get a cheeseburger anytime especially since I really don't care for breakfast food. I really like McDonald's food but I preferred tree lunch rather than breakfast. Thank you thank you thank you.",
"I guess the employee decided to buy their lunch with my card my card hoping I wouldn't notice but since it took so long to run my car I want to head and check my bank account and sure enough they had bought food on my card that I did not receive leave. Had to demand for and for a refund because they acted like it was my fault and told me the charges are still pending even though they are for 2 different amounts.",
"Never order McDonald's from Uber or Skip or any delivery service for that matter, most particularly one on Elgin Street and Rideau Street, they never get the order right. Workers at either of these locations don't know how to follow simple instructions. Don't waste your money at these two locations.",
"Employees left me out in the snow and wouldn’t answer the drive through. They locked the doors and it was freezing. I asked the employee a simple question and they were so stupid they answered a completely different question. Dumb employees and bad food.",
"McDonalds food was always so good but ever since they add new/more crispy chicken sandwiches it has come out bad. At first I thought oh they must haven't had a good day but every time I go there now it's always soggy, and has no flavor. They need to fix this!!!",
"I just ordered the new crispy chicken sandwich and I'm very disappointed. Not only did it taste horrible, but it was more bun than chicken. Not at all like the commercial shows. I hate sweet pickles and there were two slices on my sandwich. I wish I could add a photo to show the huge bun and tiny chicken."
)), class = "data.frame", row.names = c(NA, -8L))

我想计算每对元素之间的余弦相似度矩阵:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

查看结果矩阵时:

 distMatrix
1 2 3 4 5 6 7 8
1 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
2 6.362649e-03 0.000000000 0.1904180 6.362649e-03 0.11468650 0.33082042 5.505664e-03 0.019623883
3 2.616818e-01 0.190417963 0.0000000 2.616818e-01 0.55622109 0.89444938 2.563879e-01 0.322025370
4 0.000000e+00 0.006362649 0.2616818 0.000000e+00 0.06794855 0.25138506 3.107289e-05 0.003658840
5 6.794855e-02 0.114686503 0.5562211 6.794855e-02 0.00000000 0.06202843 7.083380e-02 0.040392530
6 2.513851e-01 0.330820421 0.8944494 2.513851e-01 0.06202843 0.00000000 2.566349e-01 0.197460291
7 3.107289e-05 0.005505664 0.2563879 3.107289e-05 0.07083380 0.25663492 0.000000e+00 0.004363538
8 3.658840e-03 0.019623883 0.3220254 3.658840e-03 0.04039253 0.19746029 4.363538e-03 0.000000000

我的问题:我是否正确计算了余弦相似度?还有其他方法可以做到这一点吗?

谢谢!

引用文献:

最佳答案

首先,我建议使用 cosine 而不是 1-cosine,因为这样读起来更容易。使用您的代码计算余弦相似度:

library(lsa)
library(proxy)
library(tm)

text = text[,2]

corpus <- VCorpus(VectorSource(text))
tdm <- TermDocumentMatrix(corpus,
control = list(wordLengths = c(1, Inf)))
occurrence <- apply(X = tdm,
MARGIN = 1,
FUN = function(x) sum(x > 0) / ncol(tdm))

tdm_mat <- as.matrix(tdm[names(occurrence)[occurrence >= 0.5], ])

lsaSpace <- lsa(tdm_mat)

lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

distMatrix <- cosine(lsaMatrix)
round(distMatrix, 3)

输出:

      1     2     3     4     5     6     7     8
1 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
2 0.994 1.000 0.810 0.994 0.885 0.669 0.994 0.980
3 0.738 0.810 1.000 0.738 0.444 0.106 0.744 0.678
4 1.000 0.994 0.738 1.000 0.932 0.749 1.000 0.996
5 0.932 0.885 0.444 0.932 1.000 0.938 0.929 0.960
6 0.749 0.669 0.106 0.749 0.938 1.000 0.743 0.803
7 1.000 0.994 0.744 1.000 0.929 0.743 1.000 0.996
8 0.996 0.980 0.678 0.996 0.960 0.803 0.996 1.000

你的矩阵看起来不错。您可以看到,当值为 1 时,文档相似,而值为 0 时则不相似。要检查相似性是否正确,可以使用 stringdist 包中的 stringsim 函数。让我们使用以下代码比较数据集中的评论 1 和 4:

library(stringdist)
stringsim(text[1,2], text[4,2], method = "cosine")

输出:

[1] 1

如您所见,输出为 1,与您的矩阵相同。

关于R:计算余弦相似度的正确方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/72037888/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com