gpt4 book ai didi

r - 使用字符串向量中的余弦相似度过滤出相似的字符串

转载 作者:行者123 更新时间:2023-12-03 19:00:42 26 4
gpt4 key购买 nike

我有一个字符串向量。向量的一些字符串(可能超过两个)在它们包含的单词方面彼此相似。我想过滤掉与向量的任何其他字符串的余弦相似度超过 30% 的字符串。在比较的两个字符串中,我希望保留包含更多单词的字符串。也就是说,我只想要那些与原始向量的任何字符串的相似度小于 30% 的字符串。我的目标是过滤掉相似的字符串,只保留大致不同的字符串。

前任。向量是:

x <- c("Dan is a good man and very smart", "A good man is rare", "Alex can be trusted with anything", "Dan likes to share his food", "Rare are man who can be trusted", "Please share food")

结果应该给出(假设相似度小于 30%):
c("Dan is a good man and very smart", "Dan likes to share his food", "Rare are man who can be trusted")

以上结果未经验证。

我正在使用的余弦代码:
CSString_vector <- c("String One","String Two")
    corp <- tm::VCorpus(VectorSource(CSString_vector))
    controlForMatrix <- list(removePunctuation = TRUE,wordLengths = c(1, Inf),
    weighting = weightTf)
    dtm <- DocumentTermMatrix(corp,control = controlForMatrix)
    matrix_of_vector = as.matrix(dtm)
    res <- lsa::cosine(matrix_of_vector[1,], matrix_of_vector[2,])

我在 RStudio 工作。

最佳答案

因此,重新表述您想要的内容:您想计算所有字符串对的成对相似度。然后,您希望使用该相似度矩阵来识别足够不同以形成不同组的字符串组。对于这些组中的每一个,您希望删除除最长字符串之外的所有字符串并返回该字符串。我做对了吗?

经过一些试验,这是我提出的解决方案,一步一步:

  • 计算相似度矩阵并使用阈值
  • 将其二值化
  • 使用来自 igraph 的图形算法识别不同的群体(集团)包裹
  • 找到每个clique中的所有字符串并保留最长的字符串

  • 注意:我必须将阈值调整为 0.4 才能使您的示例工作。

    相似矩阵

    这在很大程度上基于您提供的代码,但我将其打包为一个函数并使用了 tidyverse使代码,至少在我看来,更具可读性。
    library(tm)
    library(lsa)
    library(tidyverse)

    get_cos_sim <- function(corpus) {
    # pre-process corpus
    doc <- corpus %>%
    VectorSource %>%
    tm::VCorpus()
    # get term frequency matrix
    tfm <- doc %>%
    DocumentTermMatrix(
    control = corpus %>% list(
    removePunctuation = TRUE,
    wordLengths = c(1, Inf),
    weighting = weightTf)) %>%
    as.matrix()
    # get row-wise similarity
    sim <- NULL
    for(i in 1:nrow(tfm)) {
    sim_i <- apply(
    X = tfm,
    MARGIN = 1,
    FUN = lsa::cosine,
    tfm[i,])
    sim <- rbind(sim, sim_i)
    }
    # set identity diagonal to zero
    diag(sim) <- 0
    # label and return
    rownames(sim) <- corpus
    return(sim)
    }

    现在我们将此函数应用于您的示例数据
    # example corpus
    strings <- c(
    "Dan is a good man and very smart",
    "A good man is rare",
    "Alex can be trusted with anything",
    "Dan likes to share his food",
    "Rare are man who can be trusted",
    "Please share food")

    # get pairwise similarities
    sim <- get_cos_sim(strings)
    # binarize (using a different threshold to make your example work)
    sim <- sim > .4

    识别不同的组

    结果证明这是一个有趣的问题!我找到了 this paper , Chalermsook & Chuzhoy:最大独立矩形集,这让我想到了 this implementationigraph包裹。基本上,我们将相似的字符串视为图中的连接顶点,然后在整个相似矩阵的图中寻找不同的组
    library(igraph)

    # create graph from adjacency matrix
    cliques <- sim %>%
    dplyr::as_data_frame() %>%
    mutate(from = row_number()) %>%
    gather(key = 'to', value = 'edge', -from) %>%
    filter(edge == T) %>%
    graph_from_data_frame(directed = FALSE) %>%
    max_cliques()

    查找最长字符串

    现在我们可以使用 cliques 列表来检索每个 vertices 的字符串。并选择每个派系中最长的字符串。 警告:图中缺少语料库中没有相似字符串的字符串。我正在手动添加它们。 igraph 中可能有一个函数更擅长处理它的包,如果有人找到东西会感兴趣
    # get the string indices per vertex clique first
    string_cliques_index <- cliques %>%
    unlist %>%
    names %>%
    as.numeric
    # find the indices that are distinct but not in a clique
    # (i.e. unconnected vertices)
    string_uniques_index <- colnames(sim)[!colnames(sim) %in% string_cliques_index] %>%
    as.numeric
    # get a list with all indices
    all_distict <- cliques %>%
    lapply(names) %>%
    lapply(as.numeric) %>%
    c(string_uniques_index)
    # get a list of distinct strings
    lapply(all_distict, find_longest, strings)

    测试用例:

    让我们用更长的不同字符串向量来测试:
    strings <- c(
    "Dan is a good man and very smart",
    "A good man is rare",
    "Alex can be trusted with anything",
    "Dan likes to share his food",
    "Rare are man who can be trusted",
    "Please share food",
    "NASA is a government organisation",
    "The FBI organisation is part of the government of USA",
    "Hurricanes are a tragedy",
    "Mangoes are very tasty to eat ",
    "I like to eat tasty food",
    "The thief was caught by the FBI")

    我得到这个二值化的相似度矩阵:
    Dan is a good man and very smart                      FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    A good man is rare TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    Alex can be trusted with anything FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    Dan likes to share his food FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
    Rare are man who can be trusted FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    Please share food FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    NASA is a government organisation FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    The FBI organisation is part of the government of USA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
    Hurricanes are a tragedy FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
    Mangoes are very tasty to eat FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
    I like to eat tasty food FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
    The thief was caught by the FBI FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE

    基于这些相似之处,预期结果将是:
    # included
    Dan is a good man and very smart
    Alex can be trusted with anything
    Dan likes to share his food
    NASA is a government organisation
    The FBI organisation is part of the government of USA
    Hurricanes are a tragedy
    Mangoes are very tasty to eat

    # omitted
    A good man is rare
    Rare are man who can be trusted
    Please share food
    I like to eat tasty food
    The thief was caught by the FBI

    实际输出具有正确的元素,但不是原始顺序。
    您可以使用原始字符串向量重新排序
    [[1]]
    [1] "The FBI organisation is part of the government of USA"

    [[2]]
    [1] "Dan is a good man and very smart"

    [[3]]
    [1] "Alex can be trusted with anything"

    [[4]]
    [1] "Dan likes to share his food"

    [[5]]
    [1] "Mangoes are very tasty to eat "

    [[6]]
    [1] "NASA is a government organisation"

    [[7]]
    [1] "Hurricanes are a tragedy"

    就这样!
    希望这是您正在寻找的内容,并且可能对其他人有用。

    关于r - 使用字符串向量中的余弦相似度过滤出相似的字符串,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49916981/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com