gpt4 book ai didi

r - 如何使用 data.table 对因子变量进行单热编码?

转载 作者:行者123 更新时间:2023-12-04 02:51:45 24 4
gpt4 key购买 nike

对于那些不熟悉的人,one-hot 编码只是指将一列类别(即一个因子)转换为多列二进制指示变量,其中每个新列对应于原始列的一个类。这个例子将更好地解释它:

dt <- data.table(
ID=1:5,
Color=factor(c("green", "red", "red", "blue", "green"), levels=c("blue", "green", "red", "purple")),
Shape=factor(c("square", "triangle", "square", "triangle", "cirlce"))
)

dt
ID Color Shape
1: 1 green square
2: 2 red triangle
3: 3 red square
4: 4 blue triangle
5: 5 green cirlce

# one hot encode the colors
color.binarized <- dcast(dt[, list(V1=1, ID, Color)], ID ~ Color, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Color_ in front of each one-hot-encoded feature
setnames(color.binarized, setdiff(colnames(color.binarized), "ID"), paste0("Color_", setdiff(colnames(color.binarized), "ID")))

# one hot encode the shapes
shape.binarized <- dcast(dt[, list(V1=1, ID, Shape)], ID ~ Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

# Prepend Shape_ in front of each one-hot-encoded feature
setnames(shape.binarized, setdiff(colnames(shape.binarized), "ID"), paste0("Shape_", setdiff(colnames(shape.binarized), "ID")))

# Join one-hot tables with original dataset
dt <- dt[color.binarized, on="ID"]
dt <- dt[shape.binarized, on="ID"]

dt
ID Color Shape Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
1: 1 green square 0 1 0 0 0 1 0
2: 2 red triangle 0 0 1 0 0 0 1
3: 3 red square 0 0 1 0 0 1 0
4: 4 blue triangle 1 0 0 0 0 0 1
5: 5 green cirlce 0 1 0 0 1 0 0

这是我经常做的事情,正如您所见,这非常乏味(尤其是当我的数据有很多因子列时)。有没有更简单的方法来使用 data.table 做到这一点?特别是,当我尝试执行类似操作时,我认为 dcast 将允许我一次对多个列进行单热编码
dcast(dt[, list(V1=1, ID, Color, Shape)], ID ~ Color + Shape, fun=sum, value.var="V1", drop=c(TRUE, FALSE))

我得到列组合
   ID blue_cirlce blue_square blue_triangle green_cirlce green_square green_triangle red_cirlce red_square red_triangle purple_cirlce purple_square purple_triangle
1: 1 0 0 0 0 1 0 0 0 0 0 0 0
2: 2 0 0 0 0 0 0 0 0 1 0 0 0
3: 3 0 0 0 0 0 0 0 1 0 0 0 0
4: 4 0 0 1 0 0 0 0 0 0 0 0 0
5: 5 0 0 0 1 0 0 0 0 0 0 0 0

最佳答案

干得好:

dcast(melt(dt, id.vars='ID'), ID ~ variable + value, fun = length)
# ID Color_blue Color_green Color_red Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 1 0
#2: 2 0 0 1 0 0 1
#3: 3 0 0 1 0 1 0
#4: 4 1 0 0 0 0 1
#5: 5 0 1 0 1 0 0

要获得缺失的因素,您可以执行以下操作:
res = dcast(melt(dt, id = 'ID', value.factor = T), ID ~ value, drop = F, fun = length)
setnames(res, c("ID", unlist(lapply(2:ncol(dt),
function(i) paste(names(dt)[i], levels(dt[[i]]), sep = "_")))))
res
# ID Color_blue Color_green Color_red Color_purple Shape_cirlce Shape_square Shape_triangle
#1: 1 0 1 0 0 0 1 0
#2: 2 0 0 1 0 0 0 1
#3: 3 0 0 1 0 0 1 0
#4: 4 1 0 0 0 0 0 1
#5: 5 0 1 0 0 1 0 0

关于r - 如何使用 data.table 对因子变量进行单热编码?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39905820/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com