gpt4 book ai didi

R 根据多个条件与 data.frame 相交

转载 作者:行者123 更新时间:2023-12-01 17:26:44 25 4
gpt4 key购买 nike

我正在尝试根据多个条件的两个 data.frames 的交集来填充二进制向量。

我的代码可以工作,但我觉得仅仅为了获取二进制向量就占用了过多的内存。

当我将代码应用于完整数据(40mm+ 行)时。我开始出现内存问题。

是否有更简单的方法来生成向量?

这里是一些示例数据(例如,子示例将仅包含完整示例中的观测值):

ob1_1 <- as.data.frame(cbind(c(1999),c("111","222","666","777")),stringsAsFactors=FALSE)
ob2_1 <- as.data.frame(cbind(c(2000),c("111","333","555","777")),stringsAsFactors=FALSE)
ob3_1 <- as.data.frame(cbind(c(2001),c("111","222","333","777")),stringsAsFactors=FALSE)
ob4_1 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

full_sample <- rbind(ob1_1,ob2_1,ob3_1,ob4_1)
colnames(full_sample) <- c("yr","ID")

ob1_2 <- as.data.frame(cbind(c(1999),c("111","222","777")),stringsAsFactors=FALSE)
ob2_2 <- as.data.frame(cbind(c(2000),c("333")),stringsAsFactors=FALSE)
ob3_2 <- as.data.frame(cbind(c(2001),c("888")),stringsAsFactors=FALSE)
ob4_2 <- as.data.frame(cbind(c(2002),c("111","444","555","777")),stringsAsFactors=FALSE)

sub_sample <- rbind(ob1_2,ob2_2,ob3_2,ob4_2)
colnames(sub_sample) <- c("yr","ID")

这是我的工作代码:

q_intersect <- ""
q_intersect <- paste(q_intersect , "select a.yr, a.ID ", sep=" ")
q_intersect <- paste(q_intersect , "from full_sample a ", sep=" ")
q_intersect <- paste(q_intersect , "intersect ", sep=" ")
q_intersect <- paste(q_intersect , "select b.yr, b.ID ", sep=" ")
q_intersect <- paste(q_intersect , "from sub_sample b ", sep=" ")
q_intersect <- trim(gsub(" {2,}", " ", q_intersect ))

intersect_temp <- cbind(sqldf(q_intersect ),1)
colnames(intersect_temp ) <- c("yr","ID","in_both")

q_expand <- ""
q_expand <- paste(q_expand , "select in_both ", sep=" ")
q_expand <- paste(q_expand , "from full_sample a ", sep=" ")
q_expand <- paste(q_expand , "left join intersect_temp b ", sep=" ")
q_expand <- paste(q_expand , "on a.yr=b.yr ", sep=" ")
q_expand <- paste(q_expand , "and a.ID=b.ID ", sep=" ")
q_expand <- trim(gsub(" {2,}", " ", q_expand ))

solution <- as.integer(sqldf(q_expand)[,1])
solution [is.na(solution )] <- 0

提前感谢您的帮助!

最佳答案

目前尚不完全清楚您要实现的目标,但我相信这样的事情会简单得多。

library(data.table)
fullDT <- data.table(full_sample, key=c("yr", "ID"))
subDT <- data.table(sub_sample, key=c("yr", "ID"))

fullDT[ , intersect := 0L]
fullDT[subDT, intersect := 1, nomatch=0]

想法是,将每个 data.tablekey 设置为要相交的列。当您调用 full[sub], nomatch=0] 时,您将获得内部联接,我们仅将这些值设置为 1;内连接中未识别的值保留为 0,如前面行中设置的那样。

fullDT
# yr ID intersect
# 1: 1999 111 1
# 2: 1999 222 1
# 3: 1999 666 0
# 4: 1999 777 1
# 5: 2000 111 0
# 6: 2000 333 1
# 7: 2000 555 0
# 8: 2000 777 0
# 9: 2001 111 0
# 10: 2001 222 0
# 11: 2001 333 0
# 12: 2001 777 0
# 13: 2002 111 1
# 14: 2002 444 1
# 15: 2002 555 1
# 16: 2002 777 1

关于R 根据多个条件与 data.frame 相交,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/15595238/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com