gpt4 book ai didi

Cleaning up factor levels (collapsing multiple levels/labels)(清理系数级别(折叠多个级别/标签))

转载 作者:bug小助手 更新时间:2023-10-24 21:28:30 28 4
gpt4 key购买 nike



What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.

清理包含需要折叠的多个级别的因素的最有效(即有效/适当)方法是什么?也就是说,如何将两个或两个以上的因素水平组合成一个水平。



Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":

下面是一个示例,其中两个级别“Yes”和“Y”应折叠为“Yes”,而“No”和“N”应折叠为“No”:



## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

## expectedOutput
[1] Yes Yes Yes No No <NA>
Levels: Yes No # <~~ NOTICE ONLY **TWO** LEVELS





One option is of course to clean the strings before hand using sub and friends.

当然,一种选择是事先使用SUB和FRIENDS清理字符串。



Another method, is to allow duplicate label, then drop them

另一种方法是允许重复标注,然后删除它们



## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)
droplevels(x.f)


However, is there a more effective way?

然而,有没有更有效的方法呢?






While I know that the levels and labels arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens
Needless to say, none of the following got me any closer to my goal.

虽然我知道级别和标签参数应该是向量,但我尝试了列表、命名列表和命名向量,看看会发生什么,不用说,以下这些都不会让我更接近我的目标。



  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

更多回答

Haven't tested this yet, but the R 3.5.0 (2018-04-23) release notes say "factor(x, levels, labels) now allows duplicated labels (not duplicated levels!). Hence you can map different values of x to the same level directly."

还没有测试过这个,但是R 3.5.0(2018-04-23)发布说明说“factor(x,levels,labels)现在允许重复标签(不是重复的水平!)。因此,您可以直接将不同的x值映射到同一水平。”

优秀答案推荐

UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

更新2:请看Uwe的答案,它展示了一种新的“整洁”方式,这种方式正在迅速成为标准。



UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

更新1:标签重复(但不是级别!)现在确实被允许了(根据我上面的评论);请参见Tim的回答。



ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST:
There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.

最初的答案,但仍然有用和有趣:有一个鲜为人知的选项,将命名列表传递给级别函数,就是为了这个目的。列表的名称应该是所需的级别名称,元素应该是应该重命名的当前名称。有些人(包括OP,参见李嘉图对Tim回答的评论)为了便于阅读而倾向于这样做。



x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes Yes Yes No No <NA> <NA>
## Levels: Yes No


As mentioned in the levels documentation; also see the examples there.

如级别文档中所述;另请参阅那里的示例。




value: For the 'factor' method, a
vector of character strings with length at least the number
of levels of 'x', or a named list specifying how to rename
the levels.




This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.

这也可以用一行来完成,就像Marek在这里所做的:https://stackoverflow.com/a/10432263/210673;the Level<-巫术在这里被解释为https://stackoverflow.com/a/10491881/210673.



> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes Yes Yes No No <NA>
Levels: Yes No


As the question is titled Cleaning up factor levels (collapsing multiple levels/labels), the forcats package should be mentioned here as well, for the sake of completeness. forcats appeared on CRAN in August 2016.

由于问题的标题是清理因素级别(折叠多个级别/标签),为了完整起见,这里也应该提到forcats包。2016年8月,CRAN上出现了作用力。



There are several convenience functions available for cleaning up factor levels:

有几个方便的功能可用于清理因子水平:



x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)


Collapse factor levels into manually defined groups



fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes Yes Yes No No <NA>
#Levels: No Yes


Change factor levels by hand



fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes Yes Yes No No <NA>
#Levels: No Yes


Automatically relabel factor levels, collapse as necessary



fun <- function(z) {
z[z == "Y"] <- "Yes"
z[z == "N"] <- "No"
z[!(z %in% c("Yes", "No"))] <- NA
z
}
fct_relabel(factor(x), fun)
#[1] Yes Yes Yes No No <NA>
#Levels: No Yes


Note that fct_relabel() works with factor levels, so it expects a factor as first argument. The two other functions, fct_collapse() and fct_recode(), accept also a character vector which is an undocumented feature.

请注意,fct_relabel()处理因子级别,因此它需要一个因子作为第一个参数。另外两个函数FCT_CLUSLE()和FCT_RECODE()也接受字符向量,这是一个未记录的功能。



Reorder factor levels by first appearance



The expected output given by the OP is

OP给出的预期输出为



[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No


Here the levels are ordered as they appear in x which is different from the default (?factor: The levels of a factor are by default sorted).

在这里,级别是按照它们在x中出现的顺序排列的,这不同于默认的(?系数:系数的级别在默认情况下是排序的)。



To be in line with the expected output, this can be achieved by using fct_inorder() before collapsing the levels:

为了与预期的输出一致,可以在折叠级别之前使用fct_inorder()来实现:



fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")


Both return the expected output with levels in the same order, now.

现在,两者都以相同的顺序返回具有相同级别的预期输出。



Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:

从R 3.5.0(2018-04-23)开始,您可以用一句简单明了的话来实现这一点:



x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes Yes Yes No No <NA>
# Levels: Yes No


1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron

1行,将多个值映射到同一级别,为缺少的级别设置NA“-h/t@aron



Perhaps a named vector as a key might be of use:

也许作为键的命名向量可能会有用:



> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes Yes Yes No No <NA>
Levels: No Yes


This looks very similar to your last attempt... but this one works :-)

这看起来和你上一次的尝试非常相似。但这一款管用:-)



Another way is to make a table containing the mapping:

另一种方法是制作一个包含映射的表:



# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes


I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.

我更喜欢这种方式,因为它留下了一个易于检查的对象来总结映射;而且data.table代码看起来就像该语法中的任何其他连接。






Of course, if you don't want an object like fmap summarizing the change, it can be a "one-liner":

当然,如果您不希望像FMAP这样的对象总结更改,它可以是“一行程序”:



library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes


First let's note that in this specific case we can use partial matching:

首先,让我们注意,在这个特定的情况下,我们可以使用部分匹配:



x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes


In a more general case I'd go with dplyr::recode:

在更一般的情况下,我会使用dplyr::Recode:



library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes Yes Yes No No <NA>
# Levels: Yes No


Slightly altered if the starting point is a factor:

如果起始点是一个因素,则略有更改:



x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes Yes Yes No No <NA>
# Levels: Yes No


I add this answer to demonstrate the accepted answer working on a specific factor in a dataframe, since this was not initially obvious to me (though it probably should have been).

我添加这个答案是为了演示可接受的答案在数据帧中的特定因素上起作用,因为这最初对我来说并不明显(尽管它可能应该是显而易见的)。



levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
# 0 1 Z
# 7012 2507 8
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
# 0 1
# 7020 2507


I don't know your real use-case, but would strtrim be of any use here...

我不知道你的实际用例,但strtrim在这里有什么用处吗?



factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes Yes Yes No No <NA>
#Levels: Yes No


Similar to @Aaron's approach, but slightly simpler would be:

类似于@Aaron的方法,但稍微简单一点:



x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)
# [1] "H" "N" "No" "Y" "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes Yes Yes No No <NA>
# Levels: No Yes


You may use the below function for combining/collapsing multiple factors:

您可以使用以下功能组合/折叠多个因素:



combofactor <- function(pattern_vector,
replacement_vector,
data) {
levels <- levels(data)
for (i in 1:length(pattern_vector))
levels[which(pattern_vector[i] == levels)] <-
replacement_vector[i]
levels(data) <- levels
data
}


Example:

示例:



Initialize x

初始化x



x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))


Check the structure

检查结构



str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...


Use the function:

使用函数:



x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)


Recheck the structure:

重新检查结构:



str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

更多回答

+1 more robust and I would imagine a lot safer than my attempt.

+1更结实,我想比我的尝试要安全得多。

Thanks Aaron, I like this approach in that it at least avoids the warnings associated with droplevles(factor(x, ...)) but I remain curious as to any more direct methods. eg: If it were possible to use levels=<a named list> right in the factor(.) call)

谢谢Aaron,我喜欢这种方法,因为它至少避免了与水滴相关的警告(系数(x,...))但我仍然对任何更直接的方法感到好奇。例如:如果可以在因子(.)中使用级别=<命名列表>呼叫)

Agree that it's odd this can't be done within factor; I don't know of a more direct way, except for using something like Ananda's solution or perhaps something with match.

同意这一点很奇怪,这不能在因素内完成;我不知道有更直接的方法,除了使用像Ananda的解决方案或可能使用Match的方法。

This also works for ordered and the collapsed levels are ordered as they are supplied, for example a = ordered(c(1, 2, 3)); levels(a) = list("3" = 3, "1,2" = c(1, 2)) yields the ordering Levels: 3 < 1,2.

这也适用于已排序,并且折叠的级别在它们被提供时被排序,例如a=已排序(c(1,2,3));级别(A)=列表(“3”=3,“1,2”=c(1,2))产生排序级别:3<1,2。

helpful update, but the named list is friendlier to anyone who needs to read the code

有帮助的更新,但命名列表对任何需要阅读代码的人都更友好

Thanks Ananda. This a great idea. and for my applications, I can probably do away with unname ... this just might take the cake

谢谢阿南达。这是个好主意。对于我的应用程序,我可能可以去掉无名...这可能会让你大吃一惊

Revisiting years later... this will drop levels that do not show up, which might not be desirable, e.g., with x="N" only the "No" level will show up in the result.

多年后重访……这将删除没有显示的级别,这可能不是所希望的,例如,如果x=“N”,则结果中只会显示“No”级别。

@Frank, isn't this easily resolved by adding explicit levels to the factor step?

@Frank,这难道不是通过在因素步骤中添加明确的级别来轻松解决的吗?

Ah cool stuff :) Yeah, adding explicit levels works, though you'd have to type the list a second time, save the list somewhere or do some pipery or functioning like c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA) %>% { factor(unname(.[x]), levels = unique(.)) } eh.

很酷的东西:)是的,添加显式级别是有效的,尽管您必须再次输入列表,将列表保存在某个地方,或者执行一些类似c(Y=“Yes”,Yes=“Yes”,N=“No”,No=“No”,H=NA)的空洞操作或功能(Y=“Yes”,Yes=“Yes”)%>%{factor(unname(.[X]),Levels=Unique(.))}

@frank Even more cool stuff with the added benefit that it orders the levels as in the expected out: Yes, No.

@Frank更酷的东西,还有一个额外的好处,就是它按照预期的结果订购了级别:是,不是。

Another example: franknarf1.github.io/r-tutorial/_book/tables.html#dt-recode

另一个例子:franknarf1.github.io/r-tutorial/_book/tables.html#dt-recode

28 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com