gpt4 book ai didi

r - 为下一个词预测实现 n-gram

转载 作者:行者123 更新时间:2023-12-01 02:07:29 26 4
gpt4 key购买 nike

我正在尝试使用三元组进行下一个单词预测。

我已经能够上传一个语料库并通过它们的频率识别最常见的三元组。我在 R 中使用了“ngrams”、“RWeka”和“tm”包。我按照这个问题寻求指导:

What algorithm I need to find n-grams?

text1<-readLines("MyText.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer))

如果用户输入一组单词,我将如何生成下一个单词?例如,如果用户输入“can of”,我将如何检索三个最可能的词(例如啤酒、苏打水、油漆等)?

最佳答案

这是作为初学者的一种方式:

f <- function(queryHistoryTab, query, n = 2) {
require(tau)
trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
query <- tolower(query)
idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
res <- substr(res, nchar(query) + 2, nchar(res))
return(res)
}
f(c("Can of beer" = 3, "can of Soda" = 2, "A can of water" = 1, "Buy me a can of soda, please" = 2), "Can of")
# [1] "soda" "beer"

关于r - 为下一个词预测实现 n-gram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31316274/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com