gpt4 book ai didi

r - 用最后一个非 NA 值加上另一个向量中的值以滚动方式填充向量中的 NA 值

转载 作者:行者123 更新时间:2023-12-02 07:21:26 25 4
gpt4 key购买 nike

我有一个已经订购的数据框,如下所示:

mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))

ID Level Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum <NA> P
4 A1 class C__Mamiellophyceae C
5 A1 order <NA> O
6 A1 family <NA> F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S

我想要的是用最后一个非 NA 值替换 NA 值,以滚动方式在开头添加所有“丢失”的字母......看看我下面的意思。

目标是获得这样的数据框:

  ID   Level                      Taxonomy Letter
1 A1 domain D__Eukaryota D
2 A1 kingdom K__Chloroplastida K
3 A1 phylum P__K__Chloroplastida P
4 A1 class C__Mamiellophyceae C
5 A1 order O__C__Mamiellophyceae O
6 A1 family F__O__C__Mamiellophyceae F
7 A1 genus G__Crustomastix G
8 A1 species S__Crustomastix sp. MBIC10709 S

注意最后 2 个 NA,最后一个必须携带前一个的值。查看两者中的第一个如何以 O__C 开头而最后一个以 F__O__C 开头。

到目前为止,我最好的尝试如下(感谢 Ajay Ohri):

library(zoo)
mydf <- data.frame(ID="A1", Level=c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), Taxonomy=c("D__Eukaryota","K__Chloroplastida",NA,"C__Mamiellophyceae",NA,NA,"G__Crustomastix","S__Crustomastix sp. MBIC10709"), Letter=c("D","K","P","C","O","F","G","S"))
mydf <- data.frame(lapply(mydf, as.character), stringsAsFactors=FALSE)
mydf$Letter2 <- ifelse(is.na(mydf$Taxonomy),paste(mydf$Letter,'__',sep=''),"")
mydf
mydf$Taxonomy <- paste(mydf$Letter2, na.locf(mydf$Taxonomy), sep='')
mydf

请注意,我仍然无法以滚动方式执行此操作(对于最后一个 NA,我得到的是 F__C 而不是 F__O__C)。有什么帮助吗?谢谢!

PS:如果它仍然令人困惑,请告诉我,所以我制作了另一个连续有更多 NA 的 MWE,所以我需要的更明显。

最佳答案

正如 OP 所提到的,内存消耗至关重要,这里有一个 data.table 方法,它使用 zoo 中的 na.locf() 函数 包:

library(data.table)   # CRAN version 1.10.4 used
# coerce to data.table, convert factors to characters
DT <- data.table(mydf)[, lapply(.SD, as.character)]
# set marker for NA rows
DT[, na := is.na(Taxonomy)][]
# fill NA by Last Observation Carried Forward
DT[, Taxonomy := zoo::na.locf(Taxonomy)][]
# create list of Letters and unique row count within each group of missing taxonomies
DT[(na), `:=`(tmp = .(Letter), rn = seq_len(.N)), by = .(ID, Taxonomy)][]
# replace incomplete taxonomies
DT[(na), Taxonomy := paste(c(rev(unlist(tmp)[1:rn]), Taxonomy), collapse = "__"),
by = .(ID, Taxonomy, rn)][]
# clean up
DT[, c("na", "tmp", "rn") := NULL][]
   ID   Level                      Taxonomy Letter
1: A1 domain D__Eukaryota D
2: A1 kingdom K__Chloroplastida K
3: A1 phylum P__K__Chloroplastida P
4: A1 class C__Mamiellophyceae C
5: A1 order O__C__Mamiellophyceae O
6: A1 family F__O__C__Mamiellophyceae F
7: A1 genus G__Crustomastix G
8: A1 species S__Crustomastix sp. MBIC10709 S

我已经避免链接表达式,因此代码可以逐步执行。

请注意,data.table 正在更新就地,而不是复制整个数据集,这样可以节省内存和时间。

先决条件和附加说明

回应this comment , OP has confirmed 起始数据框是有序且非冗余的ID+Level 应该是数据框的唯一键

但是,由于上述解决方案取决于这些假设,因此值得添加一些检查:

# (1) ID + Level are unique keys: find duplicate Levels per ID
stopifnot(anyDuplicated(DT, by = c("ID", "Level")) == 0L)
# (2) rows missing: count rows per ID, there should be 8 Levels
DT[, .N, by = ID][, stopifnot(all(N == 8L))]
# (3) order, wrong Level names, and tests (1) and (2) as well
# create data.table with Level in proper order and a sequence number ln
levels <- data.table(
ln = 1:8,
Level = c("domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
)
# left inner join, i.e., keep only rows with matching Level, keep order of DT
# then check for consecutively ascending level sequence numbers
levels[DT, on = "Level", nomatch = 0][, stopifnot(all(diff(ln) == 1L)), by = ID]

此外,必须确保至少为顶级 Level“域”指定了 Taxonomy。这可以通过以下方式进行双重检查:

# count number of rows with missing Taxonomy on top level "domain"
stopifnot(nrow(DT[Level == "domain" & is.na(Taxonomy)] == 0L))

分组逻辑 by = .(ID, Taxonomy)na 上的选择一起使用,即 DT[(na), .. .,以便将附加字母添加到 Taxonomy 中,而 Taxonomy 最初是缺失的。在解决方案的开发过程中,我引入了一个额外的帮助列 gn := rleid(ID, Taxonomy),它将覆盖 this comment 中提到的重复项。 , 最后,我认识到,由于先决条件,我可以刮这个专栏。

关于r - 用最后一个非 NA 值加上另一个向量中的值以滚动方式填充向量中的 NA 值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44793668/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com