gpt4 book ai didi

R:如何根据行字符串创建集群

转载 作者:行者123 更新时间:2023-12-04 13:22:52 25 4
gpt4 key购买 nike

我正在尝试根据每行的字符串值从数据中创建集群。我正在使用 R 语言。我所说的“集群”是一个可以定义每个关键字的大主题(= 系列)。我想象一些基于关键字自动生成的东西,可能是通过使用词形还原或 ngram。

例如,关键字“cloud services”和“the cloud service”都应该在“service”集群中。

这是我的输入向量:

keywords_df <- c("cloud storage", "cloud computing", "google cloud storage", "the cloud service", 
"free cloud storage", "what is cloud computing", "best cloud storage","cloud computing definition",
"amazon cloud services", "cloud service providers", "cloud services", "google cloud computing", "cloud computing services", "benefits of cloud computing")

这是预期的输出数据帧:

| Keyword                   |  Thematic |
|---------------------------|:---------:|
|cloud storage |storage |
|cloud computing |computing|
|google cloud storage |storage |
|the cloud service |service |
|free cloud storage |storage |
|what is cloud computing |computing|
|best cloud storage |storage |
|cloud computing definition |computing|
|amazon cloud service |service |
|cloud service providers |services |
|cloud service |service |
|google cloud computing |computing|
|cloud computing services |service |
|benefits of cloud computing|computing|

目标是清理“关键字”列中的数据并自动提取一种 lemm 或 ngram。

这是我目前所做的:

  1. 根据关键字列创建“主题”列:

    keywords_df <- mutate(keywords_df,Thematic=Keyword)
    keywords_df$Thematic <- as.character(keywords_df$Thematic)
  2. 删除停用词:

    stopwords_list<-(c("cloud")) #Remove the main word
    stopwords <- stopwords(kind = "en")
    stopwords <- append(stopwords,stopwords_list)
    x = keywords_df$Thematic
    x = removeWords(x,stopwords)
    keywords_df$Thematic <- x

最佳答案

您可以使用 grepl() 检查某些词的存在,例如 storagecomputingservice .这样,您可以检查 df 中是否存在给定单词:

fams   <- c("storage", "computing", "service")
family <- rep("emtpy_fam", length(df))

for(fam in fams){
family[grepl(fam, Keywords)] <- fam
}
cbind(df, family)
# Keywords family
# [1,] "cloud storage" "storage"
# [2,] "cloud computing" "computing"
---
#[13,] "cloud computing services" "service"
#[14,] "benefits of cloud computing" "computing"

不过,当然有更好的方法来做到这一点


编辑:更好的方法是使用 stringr

library(stringr)
family <- str_extract(df, pattern="storage|computing|service")
cbind(df, family)

Edit2:我看到了您的最新编辑,表明您正在寻找非预先指定的家庭描述。在这种情况下我想到的第一个方法是 Latent Dirichlet Allocation (LDA——不过不要与线性判别分析混淆)。

LDA 分析文档语料库并将潜在主题识别为单词的分布(如下面的 terms(lda.output) 所示)并识别哪个文档属于哪个主题(如 下面的主题(lda.output):

library(topicmodels)
library(tm)

# Preliminary textmining
corpus <- Corpus(VectorSource(df))
corpus <- tm_map(corpus, removeWords, "cloud")
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stemDocument)

# Term Frequency matrix
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))

lda.output <- LDA(TF, k=3)
terms(lda.output)
# Topic 1 Topic 2 Topic 3
# "servic" "comput" "storag"

cbind(df, terms(lda.output)[topics(lda.output)])
# df
#Topic 3 "cloud storage" "storag"
#Topic 2 "cloud computing" "comput"
#Topic 3 "google cloud storage" "storag"
#Topic 1 "cloud services" "servic"
#Topic 3 "free cloud storage" "storag"
#Topic 2 "what is cloud computing" "comput"
#Topic 3 "best cloud storage" "storag"
#Topic 1 "cloud computing definition" "servic"
#Topic 1 "amazon cloud services" "servic"
#Topic 3 "cloud service providers" "storag"
#Topic 2 "google cloud services" "comput"
#Topic 2 "google cloud computing" "comput"
#Topic 1 "cloud computing services" "servic"
#Topic 2 "benefits of cloud computing" "comput"

最后的注意事项:如果你希望得到 "computing" 而不是 "comput" 等,你应该在文本挖掘中更改词干提取部分。您也可以省略它,但是 "service""services" 将不会被识别为同一个词。但是,您可以手动将 "service" 替换为 "services",反之亦然。

关于R:如何根据行字符串创建集群,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47266183/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com