gpt4 book ai didi

r - 指定不同类型的缺失值 (NA)

转载 作者:行者123 更新时间:2023-12-03 23:17:34 25 4
gpt4 key购买 nike

我有兴趣指定缺失值的类型。我有不同类型缺失的数据,我试图将这些值编码为 R 中缺失,但我正在寻找一种解决方案,我仍然可以区分它们。

假设我有一些看起来像这样的数据,

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure","Unknown","Refused","Blue", "Red", "Green"), 20, rep=TRUE), b = sample(c(1, 2, 3, 77, 88, 99), 10, rep=TRUE), f = round(rnorm(n=10, mean=.90, sd=.08), digits = 2), g = sample(c("C","M","Y","K"), 10, rep=TRUE) ); df
# a b f g
# 1 Unknown 2 0.78 M
# 2 Refused 2 0.87 M
# 3 Red 77 0.82 Y
# 4 Red 99 0.78 Y
# 5 Green 77 0.97 M
# 6 Green 3 0.99 K
# 7 Red 3 0.99 Y
# 8 Green 88 0.84 C
# 9 Unknown 99 1.08 M
# 10 Refused 99 0.81 C
# 11 Blue 2 0.78 M
# 12 Green 2 0.87 M
# 13 Blue 77 0.82 Y
# 14 Don't know/Not sure 99 0.78 Y
# 15 Unknown 77 0.97 M
# 16 Refused 3 0.99 K
# 17 Blue 3 0.99 Y
# 18 Green 88 0.84 C
# 19 Refused 99 1.08 M
# 20 Red 99 0.81 C

如果我现在制作两个表,我的缺失值( "Don't know/Not sure","Unknown","Refused"77, 88, 99 )将作为常规数据包含在内,
table(df$a,df$g)
# C K M Y
# Blue 0 0 1 2
# Don't know/Not sure 0 0 0 1
# Green 2 1 2 0
# Red 1 0 0 3
# Refused 1 1 2 0
# Unknown 0 0 3 0


table(df$b,df$g)
# C K M Y
# 2 0 0 4 0
# 3 0 2 0 2
# 77 0 0 2 2
# 88 2 0 0 0
# 99 2 0 2 2

我现在重新编码三个因子水平 "Don't know/Not sure","Unknown","Refused"进入 <NA>
is.na(df[,c("a")]) <- df[,c("a")]=="Don't know/Not sure"|df[,c("a")]=="Unknown"|df[,c("a")]=="Refused"

并删除空的级别
df$a <- factor(df$a) 

对数值 77, 88, 也是如此。和 99
is.na(df) <- df=="77"|df=="88"|df=="99"

table(df$a, df$g, useNA = "always")
# C K M Y <NA>
# Blue 0 0 1 2 0
# Green 2 1 2 0 0
# Red 1 0 0 3 0
# <NA> 1 1 5 1 0

table(df$b,df$g, useNA = "always")
# C K M Y <NA>
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0

现在缺失的类别被重新编码为 NA但它们都混为一谈。有没有办法将某些东西重新编码为丢失的东西,但保留原始值?我想要 R 线程 "Don't know/Not sure","Unknown","Refused"77, 88, 99丢失,但我希望能够仍然拥有变量中的信息。

最佳答案

据我所知,base R 没有内置的方法来处理不同的 NA类型。 ( 编辑器: 它确实: NA_integer_NA_real_NA_complex_NA_character 。见 ?base::NA 。)

一种选择是使用这样做的包,例如“memisc”。这是一些额外的工作,但它似乎可以满足您的需求。

下面是一个例子:

首先,您的数据。我制作了一份副本,因为我们将对数据集进行一些非常重要的更改,并且有备份总是很好的。

set.seed(667) 
df <- data.frame(a = sample(c("Don't know/Not sure", "Unknown",
"Refused", "Blue", "Red", "Green"),
20, replace = TRUE),
b = sample(c(1, 2, 3, 77, 88, 99), 10,
replace = TRUE),
f = round(rnorm(n = 10, mean = .90, sd = .08),
digits = 2),
g = sample(c("C", "M", "Y", "K"), 10,
replace = TRUE))
df2 <- df

让我们分解变量“a”:
df2$a <- factor(df2$a, 
levels = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown"),
labels = c(1, 2, 3, 77, 88, 99))

加载“memisc”库:
library(memisc)

现在,将变量“a”和“b”转换为 item在“memisc”中:
df2$a <- as.item(as.character(df2$a), 
labels = structure(c(1, 2, 3, 77, 88, 99),
names = c("Blue", "Red", "Green",
"Don't know/Not sure",
"Refused", "Unknown")),
missing.values = c(77, 88, 99))
df2$b <- as.item(df2$b,
labels = c(1, 2, 3, 77, 88, 99),
missing.values = c(77, 88, 99))

通过这样做,我们有一个新的数据类型。比较以下内容:
as.factor(df2$a)
# [1] <NA> <NA> Red Red Green Green Red Green <NA> <NA> Blue
# [12] Green Blue <NA> <NA> <NA> Blue Green <NA> Red
# Levels: Blue Red Green
as.factor(include.missings(df2$a))
# [1] *Unknown *Refused Red
# [4] Red Green Green
# [7] Red Green *Unknown
# [10] *Refused Blue Green
# [13] Blue *Don't know/Not sure *Unknown
# [16] *Refused Blue Green
# [19] *Refused Red
# Levels: Blue Red Green *Don't know/Not sure *Refused *Unknown

我们可以使用这些信息来创建表现出您所描述方式的表格,同时保留所有原始信息。
table(as.factor(include.missings(df2$a)), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
# *Don't know/Not sure 0 0 0 1
# *Refused 1 1 2 0
# *Unknown 0 0 3 0
table(as.factor(df2$a), df2$g)
#
# C K M Y
# Blue 0 0 1 2
# Red 1 0 0 3
# Green 2 1 2 0
table(as.factor(df2$a), df2$g, useNA="always")
#
# C K M Y <NA>
# Blue 0 0 1 2 0
# Red 1 0 0 3 0
# Green 2 1 2 0 0
# <NA> 1 1 5 1 0

具有缺失数据的数字列的表的行为方式相同。
table(as.factor(include.missings(df2$b)), df2$g)
#
# C K M Y
# 1 0 0 0 0
# 2 0 0 4 0
# 3 0 2 0 2
# *77 0 0 2 2
# *88 2 0 0 0
# *99 2 0 2 2
table(as.factor(df2$b), df2$g, useNA="always")
#
# C K M Y <NA>
# 1 0 0 0 0 0
# 2 0 0 4 0 0
# 3 0 2 0 2 0
# <NA> 4 0 4 4 0

作为奖励,您可以获得生成漂亮的工具 codebook s:
> codebook(df2$a)
========================================================================

df2$a

------------------------------------------------------------------------

Storage mode: character
Measurement: nominal
Missing values: 77, 88, 99

Values and labels N Percent

1 'Blue' 3 25.0 15.0
2 'Red' 4 33.3 20.0
3 'Green' 5 41.7 25.0
77 M 'Don't know/Not sure' 1 5.0
88 M 'Refused' 4 20.0
99 M 'Unknown' 3 15.0

但是,我也建议您阅读 the comment来自@Maxim.K 关于什么真正构成缺失值。

关于r - 指定不同类型的缺失值 (NA),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16074384/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com