gpt4 book ai didi

r - 根据排序顺序从 data.table 中排除行

转载 作者:行者123 更新时间:2023-12-02 04:19:00 24 4
gpt4 key购买 nike

我需要一些帮助来过滤 R 中的 data.table。我有一个文件,其中包含数百万行,每行 4 个单词。

我想删除一些不需要的行。每行有 4 个单词和一个频率。

对于前 3 个单词的每个组合,我只想保留“出现频率最高”的 3 个单词。

下面是 data.table 的示例以及我需要的输出。

text <- c("Run to the hills", "Run to the mountains", "Run to the highway", "Run to the top", "Run to the horizon",
"Go away with him", "Go away with her",
"I am a good", "I am a bad", "I am a uggly", "I am a guy", "I am a woman",
"I am the most")

frequency <- c(0.1, 0.09, 0.2, 0.05, 0.001,
0.05, 0.04,
0.1, 0.06, 0.3, 0.05, 0.1,
0.2)

DT <- data.table(text = text, frequency = frequency)

#Original output:
text frequency
1: Run to the hills 0.100
2: Run to the mountains 0.090
3: Run to the highway 0.200
4: Run to the top 0.050
5: Run to the horizon 0.001
6: Go away with him 0.050
7: Go away with her 0.040
8: I am a good 0.100
9: I am a bad 0.060
10: I am a uggly 0.300
11: I am a guy 0.050
12: I am a woman 0.100
13: I am awesome 0.200

所需输出:(仅来自相同“前 3 个单词”的前 3 个频率)

                 text frequency
1: Go away with him 0.05
2: Go away with her 0.04
3: I am a uggly 0.30
4: I am a woman 0.10
5: I am a good 0.10
6: I am the most 0.20
7: Run to the highway 0.20
8: Run to the hills 0.10
9: Run to the mountains 0.09

所以,我想只保留按频率列排序的前 3 个:“跑到 XXXXX”、“带着 XXXXX 走开”、“我是 XXXXX”、“我是 XXXXX”

在这种情况下,我会放弃:“跑向顶峰”、“跑向地平线”、“我是坏人”、“我是男人”

我正在考虑使用正则表达式,但我现在有点迷失了:-\

最佳答案

您可以使用 sub() 创建一个由前三个单词组成的 id 列,然后使用它来获取频率的前三个值。

做起来比说的容易......

library(data.table)

## add an id column containing only the first three words
DT[, id := sub(" \\S+$", "", text)]
## order by frequency, take the top three by id, remove id and NAs
## and with a little help from Frank :)
na.omit(
DT[order(frequency, decreasing = TRUE), .SD[1:3], keyby = id][, id := NULL][]
)
# text frequency
# 1: Go away with him 0.05
# 2: Go away with her 0.04
# 3: I am a uggly 0.30
# 4: I am a good 0.10
# 5: I am a woman 0.10
# 6: I am the most 0.20
# 7: Run to the highway 0.20
# 8: Run to the hills 0.10
# 9: Run to the mountains 0.09

关于r - 根据排序顺序从 data.table 中排除行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31800256/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com