gpt4 book ai didi

r - 如何将具有单列的 R 数据框转换为 tm 的语料库,以便每一行都被视为文档?

转载 作者:行者123 更新时间:2023-12-01 08:18:12 24 4
gpt4 key购买 nike

我想使用tm包的findAssocs命令,但只有在语料库中有多个文档时才有效。相反,我有一个单列数据框,其中每一行都包含来自推文的文本。是否可以将其转换为将每一行作为一个新文档的语料库?

VCorpus (documents: 1, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 1)

我有 10 行数据,我希望将其转换为

VCorpus (documents: 10, metadata (corpus/indexed): 0/0)
TermDocumentMatrix (terms: 71, documents: 10)

最佳答案

我建议您在继续之前先阅读 tm-vignette。在下面回答您的具体问题。

创建示例数据:

txt <- strsplit("I wanted to use the findAssocs of the tm package. but it works only when there are more than one documents in the corpus. I have a data frame table which has one column and each row has a tweet text. Is it possible to convert the into a corpus which takes each row as a new document?", split=" ")[[1]]
data <- data.frame(text=txt, stringsAsFactors=FALSE)
data[1:5, ]

将您的数据导入“源”,将您的“源”导入“语料库”,然后从您的“语料库”中制作 TDM:

library(tm)
tdm <- TermDocumentMatrix(Corpus(DataframeSource(data)))

show(tdm)
#A term-document matrix (35 terms, 58 documents)
#
#Non-/sparse entries: 43/1987
#Sparsity : 98%
#Maximal term length: 10
#Weighting : term frequency (tf)

str(tdm)
#List of 6
# $ i : int [1:43] 32 31 28 12 28 21 3 35 20 33 ...
# $ j : int [1:43] 2 4 5 6 8 10 11 13 14 15 ...
# $ v : num [1:43] 1 1 1 1 1 1 1 1 1 1 ...
# $ nrow : int 35
# $ ncol : int 58
# $ dimnames:List of 2
# ..$ Terms: chr [1:35] "and" "are" "but" "column" ...
# ..$ Docs : chr [1:58] "1" "2" "3" "4" ...
# - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
# - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

关于r - 如何将具有单列的 R 数据框转换为 tm 的语料库,以便每一行都被视为文档?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26711423/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com