gpt4 book ai didi

r - 使用 tm 和 RWeka 创建 N-Grams - 适用于 VCorpus,但不适用于 Corpus

转载 作者:行者123 更新时间:2023-12-02 09:05:34 26 4
gpt4 key购买 nike

按照使用“tm”和“RWeka”包创建biGrams的许多指南,我感到沮丧的是,1-Grams在<强>tdm。经过多次试验和错误,我发现使用“VCorpus”可以实现正确的功能,但不能使用“Corpus”。顺便说一句,我很确定这在大约 1 个月前可以与“Corpus”一起使用,但现在不行了。

R (3.3.3)、RTools (3.4)、RStudio (1.0.136) 和所有软件包(tm 0.7-1、RWeka 0.4-31)已更新到最新版本。

如果您能了解这不适用于 Corpus 以及其他人是否也遇到同样的问题,我将不胜感激。

#A Reproducible example
#
#Weka bi-gram test
#

library(tm)
library(RWeka)

someCleanText <- c("Congress shall make no law respecting an establishment of",
"religion, or prohibiting the free exercise thereof or",
"abridging the freedom of speech or of the press or the",
"right of the people peaceably to assemble and to petition",
"the Government for a redress of grievances")

aCorpus <- Corpus(VectorSource(someCleanText)) #With this, only 1-Grams are created
#aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer))

print(aTDM$dimnames$Terms)

“语料库”的结果

 [1] "congress"      "establishment" "law"           "make"         
[5] "respecting" "shall" "exercise" "free"
[9] "prohibiting" "religion" "the" "thereof"
[13] "abridging" "freedom" "press" "speech"
[17] "and" "assemble" "peaceably" "people"
[21] "petition" "right" "for" "government"
[25] "grievances" "redress"

“VCorpus”的结果

 [1] "a redress"        "abridging the"    "an establishment" "and to"          
[5] "assemble and" "congress shall" "establishment of" "exercise thereof"
[9] "for a" "free exercise" "freedom of" "government for"
[13] "law respecting" "make no" "no law" "of grievances"
[17] "of speech" "of the" "or of" "or prohibiting"
[21] "or the" "peaceably to" "people peaceably" "press or"
[25] "prohibiting the" "redress of" "religion or" "respecting an"
[29] "right of" "shall make" "speech or" "the free"
[33] "the freedom" "the government" "the people" "the press"
[37] "thereof or" "to assemble" "to petition"

最佳答案

我之前使用的是 R.3.4.1,后来更改为 R3.3.3,现在 VCorpus 解决方案对我有用。 TM 和 RWeka 都正确创建了二元组。

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

关于r - 使用 tm 和 RWeka 创建 N-Grams - 适用于 VCorpus,但不适用于 Corpus,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42757183/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com