Cleaning up factor levels (collapsing multiple levels/labels)(清理系数级别(折叠多个级别/标签))-6ren

Cleaning up factor levels (collapsing multiple levels/labels)(清理系数级别(折叠多个级别/标签))

转载作者：bug小助手更新时间：2023-10-24 21:28:30

What is the most effective (ie efficient / appropriate) way to clean up a factor containing multiple levels that need to be collapsed? That is, how to combine two or more factor levels into one.

清理包含需要折叠的多个级别的因素的最有效(即有效/适当)方法是什么？也就是说，如何将两个或两个以上的因素水平组合成一个水平。

Here's an example where the two levels "Yes" and "Y" should be collapsed to "Yes", and "No" and "N" collapsed to "No":

下面是一个示例，其中两个级别“Yes”和“Y”应折叠为“Yes”，而“No”和“N”应折叠为“No”：

## Given: 
x <- c("Y", "Y", "Yes", "N", "No", "H")   # The 'H' should be treated as NA

## expectedOutput
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No  # <~~ NOTICE ONLY **TWO** LEVELS

One option is of course to clean the strings before hand using sub and friends.

当然，一种选择是事先使用SUB和FRIENDS清理字符串。

Another method, is to allow duplicate label, then drop them

另一种方法是允许重复标注，然后删除它们

## Duplicate levels ==> "Warning: deprecated"
x.f <- factor(x, levels=c("Y", "Yes", "No", "N"), labels=c("Yes", "Yes", "No", "No"))

## the above line can be wrapped in either of the next two lines
factor(x.f)      
droplevels(x.f)

However, is there a more effective way?

然而，有没有更有效的方法呢？

While I know that the levels and labels arguments should be vectors, I experimented with lists and named lists and named vectors to see what happens
Needless to say, none of the following got me any closer to my goal.

虽然我知道级别和标签参数应该是向量，但我尝试了列表、命名列表和命名向量，看看会发生什么，不用说，以下这些都不会让我更接近我的目标。

  factor(x, levels=list(c("Yes", "Y"), c("No", "N")), labels=c("Yes", "No"))
  factor(x, levels=c("Yes", "No"), labels=list(c("Yes", "Y"), c("No", "N")))

  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))
  factor(x, levels=c("Y", "Yes", "No", "N"), labels=c(Yes="Y", Yes="Yes", No="No", No="N"))
  factor(x, levels=c("Yes", "No"), labels=c(Y="Yes", Yes="Yes", No="No", N="No"))

更多回答

Haven't tested this yet, but the R 3.5.0 (2018-04-23) release notes say "factor(x, levels, labels) now allows duplicated labels (not duplicated levels!). Hence you can map different values of x to the same level directly."

还没有测试过这个，但是R 3.5.0（2018-04-23）发布说明说“factor（x，levels，labels）现在允许重复标签（不是重复的水平！）。因此，您可以直接将不同的x值映射到同一水平。”

优秀答案推荐

UPDATE 2: See Uwe's answer which shows the new "tidyverse" way of doing this, which is quickly becoming the standard.

更新2：请看Uwe的答案，它展示了一种新的“整洁”方式，这种方式正在迅速成为标准。

UPDATE 1: Duplicated labels (but not levels!) are now indeed allowed (per my comment above); see Tim's answer.

更新1：标签重复(但不是级别！)现在确实被允许了(根据我上面的评论)；请参见Tim的回答。

ORIGINAL ANSWER, BUT STILL USEFUL AND OF INTEREST:
There is a little known option to pass a named list to the levels function, for exactly this purpose. The names of the list should be the desired names of the levels and the elements should be the current names that should be renamed. Some (including the OP, see Ricardo's comment to Tim's answer) prefer this for ease of reading.

最初的答案，但仍然有用和有趣：有一个鲜为人知的选项，将命名列表传递给级别函数，就是为了这个目的。列表的名称应该是所需的级别名称，元素应该是应该重命名的当前名称。有些人(包括OP，参见李嘉图对Tim回答的评论)为了便于阅读而倾向于这样做。

x <- c("Y", "Y", "Yes", "N", "No", "H", NA)
x <- factor(x)
levels(x) <- list("Yes"=c("Y", "Yes"), "No"=c("N", "No"))
x
## [1] Yes  Yes  Yes  No   No   <NA>  <NA>
## Levels: Yes No

As mentioned in the levels documentation; also see the examples there.

如级别文档中所述；另请参阅那里的示例。

value: For the 'factor' method, a
vector of character strings with length at least the number
of levels of 'x', or a named list specifying how to rename
the levels.

This can also be done in one line, as Marek does here: https://stackoverflow.com/a/10432263/210673; the levels<- sorcery is explained here https://stackoverflow.com/a/10491881/210673.

这也可以用一行来完成，就像Marek在这里所做的：https://stackoverflow.com/a/10432263/210673；the Level<-巫术在这里被解释为https://stackoverflow.com/a/10491881/210673.

> `levels<-`(factor(x), list(Yes=c("Y", "Yes"), No=c("N", "No")))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

As the question is titled Cleaning up factor levels (collapsing multiple levels/labels), the forcats package should be mentioned here as well, for the sake of completeness. forcats appeared on CRAN in August 2016.

由于问题的标题是清理因素级别(折叠多个级别/标签)，为了完整起见，这里也应该提到forcats包。2016年8月，CRAN上出现了作用力。

There are several convenience functions available for cleaning up factor levels:

有几个方便的功能可用于清理因子水平：

x <- c("Y", "Y", "Yes", "N", "No", "H") 

library(forcats)

Collapse factor levels into manually defined groups

fct_collapse(x, Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Change factor levels by hand

fct_recode(x, Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Automatically relabel factor levels, collapse as necessary

fun <- function(z) {
  z[z == "Y"] <- "Yes"
  z[z == "N"] <- "No"
  z[!(z %in% c("Yes", "No"))] <- NA
  z
}
fct_relabel(factor(x), fun)
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: No Yes

Note that fct_relabel() works with factor levels, so it expects a factor as first argument. The two other functions, fct_collapse() and fct_recode(), accept also a character vector which is an undocumented feature.

请注意，fct_relabel()处理因子级别，因此它需要一个因子作为第一个参数。另外两个函数FCT_CLUSLE()和FCT_RECODE()也接受字符向量，这是一个未记录的功能。

Reorder factor levels by first appearance

The expected output given by the OP is

OP给出的预期输出为

[1] Yes  Yes  Yes  No   No   <NA>
Levels: Yes No

Here the levels are ordered as they appear in x which is different from the default (?factor: The levels of a factor are by default sorted).

在这里，级别是按照它们在x中出现的顺序排列的，这不同于默认的(？系数：系数的级别在默认情况下是排序的)。

To be in line with the expected output, this can be achieved by using fct_inorder() before collapsing the levels:

为了与预期的输出一致，可以在折叠级别之前使用fct_inorder()来实现：

fct_collapse(fct_inorder(x), Yes = c("Y", "Yes"), No = c("N", "No"), NULL = "H")
fct_recode(fct_inorder(x), Yes = "Y", Yes = "Yes", No = "N", No = "No", NULL = "H")

Both return the expected output with levels in the same order, now.

现在，两者都以相同的顺序返回具有相同级别的预期输出。

Since R 3.5.0 (2018-04-23) you can do this in one clear and simple line:

从R 3.5.0(2018-04-23)开始，您可以用一句简单明了的话来实现这一点：

x = c("Y", "Y", "Yes", "N", "No", "H") # The 'H' should be treated as NA

tmp = factor(x, levels= c("Y", "Yes", "N", "No"), labels= c("Yes", "Yes", "No", "No"))
tmp
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

1 line, maps multiple values to the same level, sets NA for missing levels" – h/t @Aaron

1行，将多个值映射到同一级别，为缺少的级别设置NA“-h/t@aron

Perhaps a named vector as a key might be of use:

也许作为键的命名向量可能会有用：

> factor(unname(c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA)[x]))
[1] Yes  Yes  Yes  No   No   <NA>
Levels: No Yes

This looks very similar to your last attempt... but this one works :-)

这看起来和你上一次的尝试非常相似。但这一款管用：-)

Another way is to make a table containing the mapping:

另一种方法是制作一个包含映射的表：

# stacking the list from Aaron's answer
fmap = stack(list(Yes = c("Y", "Yes"), No = c("N", "No")))

fmap$ind[ match(x, fmap$values) ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

# or...

library(data.table)
setDT(fmap)[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

I prefer this way, since it leaves behind an easily inspected object summarizing the map; and the data.table code looks just like any other join in that syntax.

我更喜欢这种方式，因为它留下了一个易于检查的对象来总结映射；而且data.table代码看起来就像该语法中的任何其他连接。

Of course, if you don't want an object like fmap summarizing the change, it can be a "one-liner":

当然，如果您不希望像FMAP这样的对象总结更改，它可以是“一行程序”：

library(data.table)
setDT(stack(list(Yes = c("Y", "Yes"), No = c("N", "No"))))[x, on=.(values), ind ]
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

First let's note that in this specific case we can use partial matching:

首先，让我们注意，在这个特定的情况下，我们可以使用部分匹配：

x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c("Yes","No")
x <- factor(y[pmatch(x,y,duplicates.ok = TRUE)])
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

In a more general case I'd go with dplyr::recode:

在更一般的情况下，我会使用dplyr：：Recode：

library(dplyr)
x <- c("Y", "Y", "Yes", "N", "No", "H")
y <- c(Y="Yes",N="No")
x <- recode(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

Slightly altered if the starting point is a factor:

如果起始点是一个因素，则略有更改：

x <- factor(c("Y", "Y", "Yes", "N", "No", "H"))
y <- c(Y="Yes",N="No")
x <- recode_factor(x,!!!y)
x <- factor(x,y)
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: Yes No

I add this answer to demonstrate the accepted answer working on a specific factor in a dataframe, since this was not initially obvious to me (though it probably should have been).

我添加这个答案是为了演示可接受的答案在数据帧中的特定因素上起作用，因为这最初对我来说并不明显(尽管它可能应该是显而易见的)。

levels(df$var1)
# "0" "1" "Z"
summary(df$var1)
#    0    1    Z 
# 7012 2507    8 
levels(df$var1) <- list("0"=c("Z", "0"), "1"=c("1"))
levels(df$var1)
# "0" "1"
summary(df$var1)
#    0    1 
# 7020 2507

I don't know your real use-case, but would strtrim be of any use here...

我不知道你的实际用例，但strtrim在这里有什么用处吗？

factor( strtrim( x , 1 ) , levels = c("Y" , "N" ) , labels = c("Yes" , "No" ) )
#[1] Yes  Yes  Yes  No   No   <NA>
#Levels: Yes No

Similar to @Aaron's approach, but slightly simpler would be:

类似于@Aaron的方法，但稍微简单一点：

x <- c("Y", "Y", "Yes", "N", "No", "H")
x <- factor(x)
# levels(x)  
# [1] "H"   "N"   "No"  "Y"   "Yes"
# NB: the offending levels are 1, 2, & 4
levels(x)[c(1,2,4)] <- c(NA, "No", "Yes")
x
# [1] Yes  Yes  Yes  No   No   <NA>
# Levels: No Yes

You may use the below function for combining/collapsing multiple factors:

您可以使用以下功能组合/折叠多个因素：

combofactor <- function(pattern_vector,
         replacement_vector,
         data) {
 levels <- levels(data)
 for (i in 1:length(pattern_vector))
      levels[which(pattern_vector[i] == levels)] <-
        replacement_vector[i]
 levels(data) <- levels
  data
}

Example:

示例：

Initialize x

初始化x

x <- factor(c(rep("Y",20),rep("N",20),rep("y",20),
rep("yes",20),rep("Yes",20),rep("No",20)))

Check the structure

检查结构

str(x)
# Factor w/ 6 levels "N","No","y","Y",..: 4 4 4 4 4 4 4 4 4 4 ...

Use the function:

使用函数：

x_new <- combofactor(c("Y","N","y","yes"),c("Yes","No","Yes","Yes"),x)

Recheck the structure:

重新检查结构：

str(x_new)
# Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...

更多回答

+1 more robust and I would imagine a lot safer than my attempt.

+1更结实，我想比我的尝试要安全得多。

Thanks Aaron, I like this approach in that it at least avoids the warnings associated with droplevles(factor(x, ...)) but I remain curious as to any more direct methods. eg: If it were possible to use levels=<a named list> right in the factor(.) call)

谢谢Aaron，我喜欢这种方法，因为它至少避免了与水滴相关的警告(系数(x，...))但我仍然对任何更直接的方法感到好奇。例如：如果可以在因子(.)中使用级别=<命名列表>呼叫)

Agree that it's odd this can't be done within factor; I don't know of a more direct way, except for using something like Ananda's solution or perhaps something with match.

同意这一点很奇怪，这不能在因素内完成；我不知道有更直接的方法，除了使用像Ananda的解决方案或可能使用Match的方法。

This also works for ordered and the collapsed levels are ordered as they are supplied, for example a = ordered(c(1, 2, 3)); levels(a) = list("3" = 3, "1,2" = c(1, 2)) yields the ordering Levels: 3 < 1,2.

这也适用于已排序，并且折叠的级别在它们被提供时被排序，例如a=已排序(c(1，2，3))；级别(A)=列表(“3”=3，“1，2”=c(1，2))产生排序级别：3<1，2。

helpful update, but the named list is friendlier to anyone who needs to read the code

有帮助的更新，但命名列表对任何需要阅读代码的人都更友好

Thanks Ananda. This a great idea. and for my applications, I can probably do away with unname ... this just might take the cake

谢谢阿南达。这是个好主意。对于我的应用程序，我可能可以去掉无名...这可能会让你大吃一惊

Revisiting years later... this will drop levels that do not show up, which might not be desirable, e.g., with x="N" only the "No" level will show up in the result.

多年后重访……这将删除没有显示的级别，这可能不是所希望的，例如，如果x=“N”，则结果中只会显示“No”级别。

@Frank, isn't this easily resolved by adding explicit levels to the factor step?

@Frank，这难道不是通过在因素步骤中添加明确的级别来轻松解决的吗？

Ah cool stuff :) Yeah, adding explicit levels works, though you'd have to type the list a second time, save the list somewhere or do some pipery or functioning like c(Y = "Yes", Yes = "Yes", N = "No", No = "No", H = NA) %>% { factor(unname(.[x]), levels = unique(.)) } eh.

很酷的东西：)是的，添加显式级别是有效的，尽管您必须再次输入列表，将列表保存在某个地方，或者执行一些类似c(Y=“Yes”，Yes=“Yes”，N=“No”，No=“No”，H=NA)的空洞操作或功能(Y=“Yes”，Yes=“Yes”)%>%{factor(unname(.[X])，Levels=Unique(.))}

@frank Even more cool stuff with the added benefit that it orders the levels as in the expected out: Yes, No.

@Frank更酷的东西，还有一个额外的好处，就是它按照预期的结果订购了级别：是，不是。

Another example: franknarf1.github.io/r-tutorial/_book/tables.html#dt-recode

另一个例子：franknarf1.github.io/r-tutorial/_book/tables.html#dt-recode

iOS : How to align labels like whatsapp chat message label and time label?
在whatsapp中，如果消息很短，文本和时间在同一行。如果消息很长，时间在右下角 - 上面的文字。我如何在 Ios 中使用 Storyboard 实现此目的最佳答案尝试使用类似这样的方法来定义
html - CSS 选择器，它接受所有 label.control-label，除了带有类 .floating-labels 的表单
我有这段代码: label.control-label{ font-weight: bold; } label.control-label::after{ content: ":";
css - 将文本定位在中
尊敬的社区成员，我想将测试中的文本放在 div 的中心。代码如下所示: Testing everything: 现在，如果我尝试以下代码部分: Testing everything: 它不会在
javascript - 防止调整大小
我有一个 DIV 元素，它有一个并在其中输入文本框。基本上，我在 DIV 元素上启用了 jQuery .resizable()，但是当您使 DIV 元素小于当前大小时，文本框会被推到新的一行。我
accessibility - aria-label 和 label 不能同时读取
请考虑以下标记。 This is a label 对我来说，这个标记是在我的自定义工具提示控件之后生成的。我在 IE 上的 JAWS 上看到的问题是它只读取“标题，而不是标签”，但是对于其他屏幕阅读
label - ionic 2 : Fab button with label
我正在按照文档使用 ionic 2 构建应用程序。我已经实现了一个带有 fab-list 的 fab 按钮。我试图在包含按钮旁边放置一个描述性标签。开箱即用的 ionic 2 似乎无法在 float
javascript - 我可以使用 label 作为 label 标签吗？
通常我使用标签标签来指向这样的输入标签 First Name: 现在我有了这个 First Name: 由于我以前没有穿过这样的东西，是否可以为 label 添加 label 标签。当我应用 Ja
label - 瓦丁 : My label ignores the carriage return character
我有一个包含换行符(“\r”)的传入文本字符串。当我输出它时: System.out.println(myString) , 回车被解释。但是，当我将字符串设置为标签的内容时，它会忽略回车。如何
label - Libreoffice 计算器 : Custom x axis label
关闭。这个问题不满足Stack Overflow guidelines .它目前不接受答案。想改善这个问题吗？更新问题，使其成为 on-topic对于堆栈溢出。 1年前关闭。 Improve thi
Excel 2013 : Label deconfliction in labeled scatter plot
在 Excel 2013 中，我使用单元格中的值标记散点图。我希望标签不重叠。我可以手动移动标签，但我创建了一个过滤器来自动创建新绘图，因此我希望标签冲突也能自动发生。这可能吗？无需 VBA 的解决
jsp - Struts2 :label : Positions of label and value are inverted
在我的 Struts2 JSP 中，我想显示一个 id，所以我写道: A${id}B ( A 和 B 用于调试) 我希望它显示为 Id:A7B 但 HTML 中生成了以下内容:A7BId: 为什么标签
Haskell Labeled AST : No instance for (Show1 (Label a)), 如何构建实例？
我想要一个带注释的 AST，所以我定义了那些递归数据结构使用 Fix : data Term a = Abstraction Name a | Application a a | Var
java - Label.setScale 和 Label.setFontScale 之间的区别？
这两种方法都没有记录，并且似乎没有达到我的预期。 mylabel.setFontScale(3f); 使明显文本变大 3 倍(我正在寻找的)，但与 Align.center 一起使用时无法正确居中>.
ios - ScrollView -> View (Label + Label + TableView) 和自动布局
ScrollView里面有两个Label(多边的)，下面是TableView(其中行数可能不同) Label 和 TableView 的高度都没有设置。所有 outlet 都对彼此上方和下方的缩进设
HTML/CSS 标签 : Labels taking on the properties of other labels
我很好奇是否有一种简单的方法可以使标签采用 CSS 样式属性的默认值。我的复选框采用了我的选项卡的属性，我只希望它们成为默认值。正如您将看到的，我更改了复选框的字体大小，使其小于选项卡。但是，我不想仅
asp.net - asp :label and HTML label?有什么区别
asp:label 和 html label 有什么区别？我知道第一个是在服务器上呈现的，所以基本上它会返回一个跨度选项卡，但它有什么用呢？在什么情况下需要使用 HTML 标记，在什么情况下需要使用
python - "NotImplementedError: Use label() to access a node label"
我需要从网站中提取所有城市名称。我在以前的项目中使用了 beautifulSoup 和 RE，但在这个网站上，城市名称是常规文本的一部分，没有特定的格式。我找到了满足我要求的地理包 ( https:/
javascript - 有没有办法使用 Material Table React 向每个列标题添加
您好，我正在尝试添加到表格的每个单元格。我在这里使用 Material 表:https://material-table.com/#/docs/features/component-overridi
R 图形 : axis label placement relative to tick labels?
我想制作一个简单的 R 图，y 轴标签位于 y 轴刻度标签上方。我用下面的代码创建了我喜欢的东西。但是它需要对 at 进行一些摸索。图形参数。问:有没有更简单的方法来做到这一点？有没有办法查询 y
r - ggplot 抛出错误 `label not found` ，而 `label` 显然存在
我可以绘制以下 df 的标签使用 geom_text : df 1 8 var 2 426 -276 hours worked per week N

bug小助手

个人简介

我是一名优秀的程序员,十分优秀！

作者热门文章

滴滴打车优惠券免费领取

全站热门文章

首页

博学

6Ren·AI

商城