gpt4 book ai didi

r - 计算r中的单词共现矩阵

转载 作者:行者123 更新时间:2023-12-02 09:22:59 24 4
gpt4 key购买 nike

我想计算 R 中的单词共现矩阵。我有以下句子数据框 -

dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

这给了我

The boy is tall.
The girl is short.
The tall boy and the short girl are friends.

我想做的是首先列出所有三个句子中的所有唯一单词,即

The
boy
is
tall
girl
short
and
are
friends

然后我想创建单词共现矩阵,该矩阵计算单词在句子中总共出现的次数,如下所示

       The   boy    is    tall    girl    short    and    are    friends
The 0 2 2 2 2 2 1 1 1
boy 2 0 1 2 1 1 1 1 1
is 2 1 0 2 1 1 0 0 0
tall 2 2 1 0 1 1 1 1 1
etc.

对于所有单词,其中单词不能与其自身同时出现。请注意,在第 3 句中,单词“the”出现了两次,解决方案应该只计算该“the”的共现一次。

有谁知道我该怎么做。我正在处理大约 3000 个句子的数据框。

最佳答案

library(tm)
library(dplyr)
dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

ds <- Corpus(DataframeSource(dat))
dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf)))

X <- inspect(dtm)
out <- crossprod(X) # Same as: t(X) %*% X
diag(out) <- 0 # rm own-word occurences
out
        Terms
Terms boy friend girl short tall the
boy 0 1 1 1 2 2
friend 1 0 1 1 1 1
girl 1 1 0 2 1 2
short 1 1 2 0 1 2
tall 2 1 1 1 0 2
the 2 1 2 2 2 0

您可能还想删除“the”等停用词,即

ds <- tm_map(ds, stripWhitespace)
ds <- tm_map(ds, removePunctuation)
ds <- tm_map(ds, stemDocument)
ds <- tm_map(ds, removeWords, c("the", stopwords("english")))
ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))

关于r - 计算r中的单词共现矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40464014/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com