gpt4 book ai didi

r - csv 文件中字符串的 tf-idf

转载 作者:行者123 更新时间:2023-11-30 09:23:23 25 4
gpt4 key购买 nike

我的 test.csv 文件是(没有标题):

very good, very bad, you are great
very bad, good restaurent, nice place to visit

我想让我的语料库用 , 分隔,以便我的最终 DocumentTermMatrix 变为:

      terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf

如果我不从 csv 文件加载文档,我就能够正确生成上述DTM,如下所示:

library(tm)
docs <- c(D1 = "very good, very bad, you are great",
D2 = "very bad, good restaurent, nice place to visit")

dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
inspect(dd)

# A corpus with 2 text documents
#
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
# create_date creator
# Available variables in the data frame are:
# MetaID

# $D1
# very~good
# very~bad
# you~are~great
#
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

这将产生

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
# D1 0.0000000 0.0000000 0 0.3333333 0.3333333
# D2 0.3333333 0.3333333 0 0.0000000 0.0000000

如果我从 csv 文件加载 document,那么只有每个文档的第一个术语会被加入,如下所示:

> file_loc <- "testdata.csv"
> require(tm)
Loading required package: tm
> x <- read.csv(file_loc, header = FALSE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> dd <- Corpus(DataframeSource(x))
> dd <- tm_map(dd, stripWhitespace)
> dd <- tm_map(dd, tolower)
> dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
> inspect(dd)

仅连接第一个项,如下所示:

# $D1
# very~good

#
# $D2
# very~bad

如何加入所有术语并创建一个像上面那样的 DocumentTermMatrix

最佳答案

您读取的数据不正确。我使用 scan 进行阅读。作品如下:

docs <- scan("testdata.csv", "character", sep = "\n")

dd <- Corpus(VectorSource(x))
dd <- tm_map(dd, function(x) {
PlainTextDocument(
gsub("\\s+","~",strsplit(x,",\\s*")[[1]]),
id=ID(x)
)
})
inspect(dd)

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

关于r - csv 文件中字符串的 tf-idf,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/24117862/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com