gpt4 book ai didi

r - 顶级类别的一个热编码,NA,其余的归入 R 中的 'others'

转载 作者:行者123 更新时间:2023-12-04 09:36:15 26 4
gpt4 key购买 nike

我只想对顶级类别、NA 和“其他”的变量进行一次热编码。

所以在这个简化的例子中,热编码 b where freq > 1 and NA:

id <- c(1, 2, 3, 4, 5, 6)
b <- c(NA, "A", "C", "A", "B", "C")
c <- c(2, 3, 6, NA, 4, 7)
df <- data.frame(id, b, c)

id b c
1 1 <NA> 2
2 2 A 3
3 3 C 6
4 4 A NA
5 5 B 4
6 6 C 7

table <- as.data.frame(table(df$b))

Var1 Freq
1 A 2
2 B 1
3 C 2

table_top <- table[table$Freq > 1,]

Var1 Freq
1 A 2
3 C 2

现在,我想要这样的东西

  id  b_NA  c b_A b_C b_Others
1 1 2 0 0 0
2 0 3 1 0 0
3 0 6 0 1 0
4 0 NA 1 0 0
5 0 4 0 0 1
6 0 7 0 1 0

我试过子集 df

table_top <- as.vector(table_top$Var1)
table_only_top <- subset(df, b %in% table_top)
table_only_top

a b c
2 1 A 3
3 2 C 6
4 2 A NA
6 3 C 7

但是,现在我不知道如何获得输出。在我的真实数据中,我的类别比此处多得多,因此无法使用输出中的名称。我的实际输出中的其他类别也存在许多类别。

非常感谢任何提示:)

最佳答案

data.tablemltools 快速而性感:

> one_hot(dt, naCols = TRUE, sparsifyNAs = TRUE)

id cat_NA cat_A cat_C cat_Others freq
1: 1 1 0 0 0 2
2: 2 0 1 0 0 3
3: 3 0 0 1 0 6
4: 4 0 1 0 0 NA
5: 5 0 0 0 1 4
6: 6 0 0 1 0 7

代码

加载库
library(dplyr)
library(data.table)
library(mltools)
转换数据
# Kick out all with freq == 1 and below
df <- df %>%
# Group by variables that will be onehotted
group_by(cat) %>%
# Add a count per group item column
mutate(count = n()) %>%
# Ungroup for next steps
ungroup() %>%
# Change all that have a count of 1 or below to "Others".
# If cat was a factor, we would get numeric results at this step.
mutate(cat = ifelse(!is.na(cat) & count <= 1, "Others", cat),
# Only now we turn it into a factor for the one_hot function
cat = as.factor(cat)) %>%
# Drop the count column
select(id, cat, freq)

# Turn into data.table
dt <- as.data.table(df)
检查中间结果
> dt
id cat freq
1: 1 <NA> 2
2: 2 A 3
3: 3 C 6
4: 4 A NA
5: 5 Others 4
6: 6 C 7

数据

id <- c(1, 2, 3, 4, 5, 6)
cat <- c(NA, "A", "C", "A", "B", "C")
freq <- c(2, 3, 6, NA, 4, 7)
# It is important to have no other factor variables other
# than the variable(s) you one want to one hot. For that reason
# the automatic factoring is turned off.
df <- data.frame(id, cat, freq,
stringsAsFactors = FALSE)

> df
id cat freq
1 1 <NA> 2
2 2 A 3
3 3 C 6
4 4 A NA
5 5 B 4
6 6 C 7

关于r - 顶级类别的一个热编码,NA,其余的归入 R 中的 'others',我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52906496/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com