% group_by(group) %>% mutat-6ren">
gpt4 book ai didi

基于语义相似性/相关性从列表中删除重复项

转载 作者:行者123 更新时间:2023-12-04 09:11:19 24 4
gpt4 key购买 nike

R + tm:如何根据语义相似性去除列表中的重复项?
v<-c("bank","banks","banking", "ford_suv',"toyota_suv","nissan_suv") .我预期的解决方案是 c("bank", "ford_suv',"toyota_suv","nissan_suv") .也就是说,bank、banks 和banking 被简化为一个术语“bank”。 SnowBall::stemming不是一个选择,因为我必须保留各国报纸风格的味道。任何帮助或指导将是有用的。

最佳答案

我们可以使用 adist 计算单词之间的 Levenshtein 距离并使用 hclust 将它们重新组合成集群

d <- adist(v)
rownames(d) <- v

这给出了术语之间的距离矩阵:
#              [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
#bank 0 1 3 8 9 8 2 13 6 5 3 4
#banks 1 0 3 7 9 7 2 13 6 6 2 5
#banking 3 3 0 8 10 8 3 13 7 6 3 7
#ford_suv 8 7 8 0 5 6 8 12 7 7 8 4
#toyota_suv 9 9 10 5 0 6 9 7 4 9 9 9
#nissan_suv 8 7 8 6 6 0 8 13 10 4 8 10
#banker 2 2 3 8 9 8 0 12 6 6 1 6
#toyota_corolla 13 13 13 12 7 13 12 0 8 13 12 12
#toyota 6 6 7 7 4 10 6 8 0 6 7 5
#nissan 5 6 6 7 9 4 6 13 6 0 7 6
#bankers 3 2 3 8 9 8 1 12 7 7 0 6
#ford 4 5 7 4 9 10 6 12 5 6 6 0

然后我们可以将它传递给 hclust使用 method = ward.D
cl <- hclust(as.dist(d), method  = "ward.D")
plot(cl)

这使:

enter image description here

我们注意到 4 个不同的集群(我们可以使用 rect.hclust(cl, 4) 来说明)

enter image description here

现在,我们可以把这个结果变成 data.frame并用它的最短期限标记每个集群:
library(dplyr)
data.frame(group = cutree(cl, 4)) %>%
tibble::rownames_to_column("term") %>%
group_by(group) %>%
mutate(tag = term[nchar(term) == min(nchar(term))])

这使:
#Source: local data frame [12 x 3]
#Groups: group [4]
#
# term group tag
# <chr> <int> <chr>
#1 bank 1 bank
#2 banks 1 bank
#3 banking 1 bank
#4 ford_suv 2 ford
#5 toyota_suv 3 toyota
#6 nissan_suv 4 nissan
#7 banker 1 bank
#8 toyota_corolla 3 toyota
#9 toyota 3 toyota
#10 nissan 4 nissan
#11 bankers 1 bank
#12 ford 2 ford

我们是否应该只提取唯一的 tag对于每个集群,我们可以添加 ... %>% distinct(tag) %>% .$tag给管道:
#[1] "bank"   "ford"   "toyota" "nissan"

引用
?adist

The (generalized) Levenshtein (or edit) distance between two strings s and t is the minimal possibly weighted number of insertions, deletions and substitutions needed to transform s into t (so that the transformation exactly matches t).


?hclust

This function performs a hierarchical cluster analysis using a set of dissimilarities for the n objects being clustered. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster.



备注 :我在评论中使用了@Abdou 提供的数据,因为它代表了一个更完整的用例

关于基于语义相似性/相关性从列表中删除重复项,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38956241/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com