gpt4 book ai didi

r - 将标记列表转换为 n-gram

转载 作者:行者123 更新时间:2023-12-04 04:58:12 37 4
gpt4 key购买 nike

我有一个已经被标记化的文档列表:

dat <- list(c("texaco", "canada", "lowered", "contract", "price", "pay", 
"crude", "oil", "canadian", "cts", "barrel", "effective", "decrease",
"brings", "companys", "posted", "price", "benchmark", "grade",
"edmonton", "swann", "hills", "light", "sweet", "canadian", "dlrs",
"bbl", "texaco", "canada", "changed", "crude", "oil", "postings",
"feb", "reuter"), c("argentine", "crude", "oil", "production",
"pct", "january", "mln", "barrels", "mln", "barrels", "january",
"yacimientos", "petroliferos", "fiscales", "january", "natural",
"gas", "output", "totalled", "billion", "cubic", "metrers", "pct",
"billion", "cubic", "metres", "produced", "january", "yacimientos",
"petroliferos", "fiscales", "added", "reuter"))

我正在尝试有效地将此标记列表转换为 n-gram 列表。这是我到目前为止编写的函数:
find_ngrams <- function(x, n){

if (n==1){ return(x)}

out <- as.list(rep(NA, length(x)))

for (i in 1:length(x)){
words <- x[[i]]
out[[i]] <- words

for (j in 2:n){

phrases <- sapply(1:j, function(k){
words[k:(length(words)-n+k)]
})

phrases <- apply(phrases, 1, paste, collapse=" ")

out[[i]] <- c(out[[i]], phrases)

}
}
return(out)
}

这对于查找 n-gram 很有效,但似乎效率低下。用 *apply 替换 for 循环函数仍然会给我留下嵌套 3 层深的循环:
result <- find_ngrams(dat, 2)
> result[[2]]
[1] "argentine" "crude" "oil"
[4] "production" "pct" "january"
[7] "mln" "barrels" "mln"
[10] "barrels" "january" "yacimientos"
[13] "petroliferos" "fiscales" "january"
[16] "natural" "gas" "output"
[19] "totalled" "billion" "cubic"
[22] "metrers" "pct" "billion"
[25] "cubic" "metres" "produced"
[28] "january" "yacimientos" "petroliferos"
[31] "fiscales" "added" "reuter"
[34] "argentine crude" "crude oil" "oil production"
[37] "production pct" "pct january" "january mln"
[40] "mln barrels" "barrels mln" "mln barrels"
[43] "barrels january" "january yacimientos" "yacimientos petroliferos"
[46] "petroliferos fiscales" "fiscales january" "january natural"
[49] "natural gas" "gas output" "output totalled"
[52] "totalled billion" "billion cubic" "cubic metrers"
[55] "metrers pct" "pct billion" "billion cubic"
[58] "cubic metres" "metres produced" "produced january"
[61] "january yacimientos" "yacimientos petroliferos" "petroliferos fiscales"
[64] "fiscales added" "added reuter"

这段代码中是否有任何重要部分可以矢量化?

/edit:这里是 Matthew Plourde 函数的更新版本,它执行“高达 n-gram”并适用于整个列表:
find_ngrams_base <- function(x, n) {
if (n == 1) return(x)
out <- lapply(1:n, function(n_i) embed(x, n_i))
out <- sapply(out, function(y) apply(y, 1, function(row) paste(rev(row), collapse=' ')))
unlist(out)
}

find_ngrams_plourde <- function(x, ...){
lapply(x, find_ngrams_base, ...)
}

我们可以对我写的函数进行基准测试,发现它有点慢:
library(rbenchmark)
benchmark(
replications=100,
a <- find_ngrams(dat, 2),
b <- find_ngrams(dat, 3),
c <- find_ngrams(dat, 4),
d <- find_ngrams(dat, 10),
w <- find_ngrams_plourde(dat, 2),
x <- find_ngrams_plourde(dat, 3),
y <- find_ngrams_plourde(dat, 4),
z <- find_ngrams_plourde(dat, 10),
columns=c('test', 'elapsed', 'relative'),
order='relative'
)
test elapsed relative
1 a <- find_ngrams(dat, 2) 0.040 1.000
2 b <- find_ngrams(dat, 3) 0.081 2.025
3 c <- find_ngrams(dat, 4) 0.117 2.925
5 w <- find_ngrams_plourde(dat, 2) 0.144 3.600
6 x <- find_ngrams_plourde(dat, 3) 0.212 5.300
7 y <- find_ngrams_plourde(dat, 4) 0.277 6.925
4 d <- find_ngrams(dat, 10) 0.361 9.025
8 z <- find_ngrams_plourde(dat, 10) 0.669 16.725

但是,它也发现我的函数遗漏了很多 ngram(哎呀):
for (i in 1:length(dat)){
print(setdiff(w[[i]], a[[i]]))
print(setdiff(x[[i]], b[[i]]))
print(setdiff(y[[i]], c[[i]]))
print(setdiff(z[[i]], d[[i]]))
}

我觉得这两个函数都可以改进,但我想不出任何方法来避免三重循环(循环向量,循环所需的 ngrams 数量,1-n,循环单词以构建 ngrams)

/编辑2:
这是一个修改后的函数,基于马特的回答:
find_ngrams_2 <- function(x, n){
if (n == 1) return(x)
lapply(x, function(y) c(y, unlist(lapply(2:n, function(n_i) do.call(paste, unname(rev(data.frame(embed(y, n_i), stringsAsFactors=FALSE))))))))
}

它似乎返回了正确的 ngram 列表,并且比我的原始函数更快(在大多数情况下):
library(rbenchmark)
benchmark(
replications=100,
a <- find_ngrams(dat, 2),
b <- find_ngrams(dat, 3),
c <- find_ngrams(dat, 4),
d <- find_ngrams(dat, 10),
w <- find_ngrams_2(dat, 2),
x <- find_ngrams_2(dat, 3),
y <- find_ngrams_2(dat, 4),
z <- find_ngrams_2(dat, 10),
columns=c('test', 'elapsed', 'relative'),
order='relative'
)

test elapsed relative
5 w <- find_ngrams_2(dat, 2) 0.039 1.000
1 a <- find_ngrams(dat, 2) 0.041 1.051
6 x <- find_ngrams_2(dat, 3) 0.078 2.000
2 b <- find_ngrams(dat, 3) 0.081 2.077
7 y <- find_ngrams_2(dat, 4) 0.119 3.051
3 c <- find_ngrams(dat, 4) 0.123 3.154
4 d <- find_ngrams(dat, 10) 0.399 10.231
8 z <- find_ngrams_2(dat, 10) 0.436 11.179

最佳答案

这是 embed 的一种方式.

find_ngrams <- function(x, n) {
if (n == 1) return(x)
c(x, apply(embed(x, n), 1, function(row) paste(rev(row), collapse=' ')))
}

您的函数中似乎存在错误。如果你解决了这个问题,我们可以做一个基准测试。

关于r - 将标记列表转换为 n-gram,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16489748/

37 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com