gpt4 book ai didi

r - 字符串拆分 data.table 列产生 NA

转载 作者:行者123 更新时间:2023-12-04 10:06:58 24 4
gpt4 key购买 nike

这是我关于 SO 的第一个问题,所以让我知道它是否可以改进。我正在 R 中从事自然语言处理项目,并试图构建一个包含测试用例的 data.table。在这里,我构建了一个非常简单的示例:

texts.dt <- data.table(string = c("one", 
"two words",
"three words here",
"four useless words here",
"five useless meaningless words here",
"six useless meaningless words here just",
"seven useless meaningless words here just to",
"eigth useless meaningless words here just to fill",
"nine useless meaningless words here just to fill up",
"ten useless meaningless words here just to fill up space"),
word.count = 1:10,
stop.at.word = c(0, 1, 2, 2, 4, 3, 3, 6, 7, 5))

这将返回我们将要处理的 data.table:
                                                          string word.count stop.at.word
1: one 1 0
2: two words 2 1
3: three words here 3 2
4: four useless words here 4 2
5: five useless meaningless words here 5 4
6: six useless meaningless words here just 6 3
7: seven useless meaningless words here just to 7 3
8: eigth useless meaningless words here just to fill 8 6
9: nine useless meaningless words here just to fill up 9 7
10: ten useless meaningless words here just to fill up space 10 5

在实际应用中,值在 stop.at.word列是随机确定的(上限 = word.count - 1)。此外,字符串不是按长度排序的,但这应该没有区别。

代码应该添加两列 inputoutput ,其中 input包含从位置 1 到 stop.at.word 的子串和 output包含后面的单词(单个单词),如下所示:
>desired_result
string word.count stop.at.word input
1: one 1 0
2: two words 2 1 two
3: three words here 3 2 three words
4: four useless words here 4 2 four useless
5: five useless meaningless words here 5 4 five useless meaningless words
6: six useless meaningless words here just 6 2 six useless
7: seven useless meaningless words here just to 7 3 seven useless meaningless
8: eigth useless meaningless words here just to fill 8 6 eigth useless meaningless words here just
9: nine useless meaningless words here just to fill up 9 7 nine useless meaningless words here just to
10: ten useless meaningless words here just to fill up space 10 5 ten useless meaningless words here
output
1:
2: words
3: here
4: words
5: here
6: meaningless
7: words
8: to
9: fill
10: just

不幸的是,我得到的是:
                                                      string word.count stop.at.word input output
1: one 1 0
2: two words 2 1 NA NA
3: three words here 3 2 NA NA
4: four useless words here 4 2 NA NA
5: five useless meaningless words here 5 4 NA NA
6: six useless meaningless words here just 6 3 NA NA
7: seven useless meaningless words here just to 7 3 NA NA
8: eigth useless meaningless words here just to fill 8 6 NA NA
9: nine useless meaningless words here just to fill up 9 7 NA NA
10: ten useless meaningless words here just to fill up space 10 5 ten NA

请注意不一致的结果,第 1 行为空字符串,第 10 行返回“十”。

这是我正在使用的代码:
    texts.dt[, c("input", "output") := .(
substr(string,
1,
sapply(gregexpr(" ", string),"[", stop.at.word) - 1),
substr(string,
sapply(gregexpr(" ", string),"[", stop.at.word),
sapply(gregexpr(" ", string),"[", stop.at.word + 1) - 1)
)]

我运行了很多测试, substr当我在控制台中尝试单个字符串时,指令运行良好,但在应用于 data.table 时失败。
我怀疑我在 data.table 中遗漏了与范围相关的内容,但是我很久没有使用这个包了,所以我很困惑。

我将不胜感激一些帮助。
提前致谢!

最佳答案

我可能会做

texts.dt[stop.at.word > 0, c("input","output") := {
sp = strsplit(string, " ")
list(
mapply(function(p,n) paste(p[seq_len(n)], collapse = " "), sp, stop.at.word),
mapply(`[`, sp, stop.at.word+1L)
)
}]

# partial result
head(texts.dt, 4)

string word.count stop.at.word input output
1: one 1 0 NA NA
2: two words 2 1 two words
3: three words here 3 2 three words here
4: four useless words here 4 2 four useless words

交替:
library(stringi)
texts.dt[stop.at.word > 0, c("input","output") := {
patt = paste0("((\\w+ ){", stop.at.word-1, "}\\w+) (.*)")
m = stri_match(string, regex = patt)
list(m[, 2], m[, 4])
}]

关于r - 字符串拆分 data.table 列产生 NA,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36651032/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com