gpt4 book ai didi

r - 根据ngrams的长度逐行子集数据

转载 作者:行者123 更新时间:2023-12-01 10:30:47 25 4
gpt4 key购买 nike

我有一个数据框,其中包含许多术语(不同大小的 ngram,最多可达 5 克)及其各自的频率:

df = data.frame(term = c("a", "a a", "a a card", "a a card base", "a a card base ne",
"a a divorce", "a a divorce lawyer", "be", "be the", "be the one"),
freq = c(131, 13, 3, 2, 1, 1, 1, 72, 17, 5))

这给了我们:

                 term freq
1 a 131
2 a a 13
3 a a card 3
4 a a card base 2
5 a a card base ne 1
6 a a divorce 1
7 a a divorce lawyer 1
8 be 72
9 be the 17
10 be the one 5

我想要的是将 unigrams(只有一个词的术语)、bigrams(有两个词的术语)、trigrams、fourgrams 和 fivegrams 分成不同的数据框:

例如,仅包含 unigrams 的“df1”将如下所示:

                 term freq
1 a 131
2 be 72

“df2”(二元组):

                 term freq
1 a a 13
2 be the 17

“df3”(八卦):

                 term freq
1 a a card 3
2 a a divorce 1
3 be the one 5

等等。任何想法?也许是正则表达式?

最佳答案

您可以按空间数拆分,即

split(df, stringr::str_count(df$term, '\\s+'))

#$`0`
# term freq
#1 a 131
#8 be 72

#$`1`
# term freq
#2 a a 13
#9 be the 17

#$`2`
# term freq
#3 a a card 3
#6 a a divorce 1
#10 be the one 5

#$`3`
# term freq
#4 a a card base 2
#7 a a divorce lawyer 1

#$`4`
# term freq
#5 a a card base ne 1

一个完全基于 R 的解决方案(正如@akrun 提到的),

split(df, lengths(gregexpr("\\S+", df$term)))

关于r - 根据ngrams的长度逐行子集数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42947893/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com