gpt4 book ai didi

r - 拆分不同的长度值并绑定(bind)到列

转载 作者:行者123 更新时间:2023-12-04 18:06:48 25 4
gpt4 key购买 nike

我有一个相当大的(大约 100k 观察)数据集,类似于:

data <- data.frame(
ID = seq(1, 5, 1),
Values = c("1,2,3", "4", " ", "4,1,6,5,1,1,6", "0,0"),
stringsAsFactors=F)
data
ID Values
1 1 1,2,3
2 2 4
3 3
4 4 4,1,6,5,1,1,6
5 5 0,0

我想通过 "," 拆分 Values 列与 NA对于错过的单元格:
ID v1 v2 v3 v4 v5 v6 v7
1 1 2 3 NA NA NA NA
2 4 NA NA NA NA NA NA
3 NA NA NA NA NA NA NA
4 4 1 6 5 1 1 6
5 0 0 NA NA NA NA NA
...

最佳尝试是 strsplit + rbind :
df <- data.frame(do.call(
"rbind",
strsplit(as.character(data$Values), split = "," , fixed = FALSE)
))

但是 rbind函数只是回收所有“短”行而不是设置“NA”。
Have found similar problem

非常感谢,狮子座

最佳答案

我建议查看 my cSplit function 或手动解决问题。
cSplit 方法很简单:

cSplit(data, "Values", ",")
# ID Values_1 Values_2 Values_3 Values_4 Values_5 Values_6 Values_7
# 1: 1 1 2 3 NA NA NA NA
# 2: 2 4 NA NA NA NA NA NA
# 3: 3 NA NA NA NA NA NA
# 4: 4 4 1 6 5 1 1 6
# 5: 5 0 0 NA NA NA NA NA

手动解决问题如下所示:
## Split up the values
Split <- strsplit(data$Values, ",", fixed = TRUE)
## How long is each list element?
Ncol <- vapply(Split, length, 1L)
## Create an empty character matrix to store the results
M <- matrix(NA_character_, nrow = nrow(data),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
## Use matrix indexing to figure out where to put the results
M[cbind(rep(1:nrow(data), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
## Bind the values back together, here as a "data.table" (faster)
data.table(ID = data$ID, M)

^^ 这几乎是 cSplit 中发生的事情,但是该函数还有一些其他选项和一些基本的错误检查等等,这可能使其比纯手动方法(或为解决您的特定问题而编写的函数)慢一点)。

这两种方法都比“data.table”+“reshape2”方法更快。此外,由于每一行都是单独处理的,因此即使您有重复的 ID 值也不应该有任何问题——您的输出应该与您的输入具有相同的行数。

基准

我已经对更多行和可以给出“更广泛”结果的数据进行了基准测试(因为您对大卫的回答的评论中暗示了这一点)。

这是示例数据:
set.seed(1)
a <- sample(0:100, 100000, TRUE)
Values <- vapply(a, function(x)
paste(sample(0:100, x, TRUE), collapse = ","), character(1L))
Values[sample(length(Values), length(Values) * .15)] <- ""
ID <- c(1:80000, 1:20000)
data <- data.frame(ID, Values, stringsAsFactors = FALSE)
DT <- as.data.table(data)

以下是要测试的功能:
fun1a <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ","))), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}

fun1b <- function(inDT) {
data2 <- DT[, list(Values = unlist(
strsplit(Values, ",", fixed = TRUE),
use.names = FALSE)), by = ID]
data2[, Var := paste0("v", seq_len(.N)), by = ID]
dcast.data.table(data2, ID ~ Var,
fill = NA_character_,
value.var = "Values")
}

fun2 <- function(inDT) {
cSplit(DT, "Values", ",")
}

fun3 <- function(inDF) {
Split <- strsplit(inDF$Values, ",", fixed = TRUE)
Ncol <- vapply(Split, length, 1L)
M <- matrix(NA_character_, nrow = nrow(inDF),
ncol = max(Ncol),
dimnames = list(NULL, paste0("V", sequence(max(Ncol)))))
M[cbind(rep(1:nrow(inDF), Ncol),
sequence(Ncol))] <- unlist(Split, use.names = FALSE)
data.table(ID = inDF$ID, M)
}

结果如下:
library(microbenchmark)
microbenchmark(fun2(DT), fun3(data), times = 20)
# Unit: seconds
# expr min lq median uq max neval
# fun2(DT) 4.810942 5.173103 5.498279 5.622279 6.003339 20
# fun3(data) 3.847228 3.929311 4.058728 4.160082 4.664568 20

## Didn't want to microbenchmark here...
system.time(fun1a(DT))
# user system elapsed
# 16.92 0.50 17.59
system.time(fun1b(DT)) # fixed = TRUE & use.names = FALSE
# user system elapsed
# 11.54 0.42 12.01

注意: fun1afun1b 的结果不会与 fun2fun3 的结果相同,因为重复的 ID。

关于r - 拆分不同的长度值并绑定(bind)到列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25244684/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com