gpt4 book ai didi

r - 如何按组连接两个数据框?

转载 作者:行者123 更新时间:2023-12-04 11:12:03 25 4
gpt4 key购买 nike

我有一个数据框 (DF),其中每个 CompanyID 都有 2006 年和 2007 年在那里工作的董事以及 2 个关于他们的信息(性别和年龄)。

DF <- 
CompanyID Name Country ISIN Director_2006 Gender_2006 Yearold_2006 Director_2007 Gender_2007 Yearold_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 11734844255 M 55
25830 BANKxxx Austria AT000504 187836811559 F 45 5524344997 F NA
25830 BANKxxx Austria AT000504 5524344997 F NA 5524354997 M 39
25830 BANKxxx Austria AT000504 5524354997 M 38 5742347684 M 38
25830 BANKxxx Austria AT000504 6613115791 M 41 40160443378 M 30
12339 BANKyyy Belgium AT034003 9855321789 M 44 9855321789 M 45
12339 BANKyyy Belgium AT034003 277520199 M NA 23779351 F 34

我有第二个数据框 (DF2),其中每个 DirectorID(第一列)都有不同年份(第二列)的经验年数(第三列)。

DF2 <- 
DirectorID Year YearsExperience
11734844255 2006 0.4
11734844255 2007 1.4
187836811559 2006 1.5
5524344997 2006 2.4
5524344997 2007 3.4
5524354997 2006 1.8
5524354997 2007 2.8
5742347684 2007 3.5
40160443378 2007 4.3
9855321789 2005 2.6
9855321789 2006 3.6
9855321789 2007 4.6
277520199 2006 1.6
23779351 2007 3.2
55443322 2005 2.5
55443322 2006 3.5

我想加入两个数据框的信息,创建一个新列,其中包含每家公司的每位董事在这两年(2006 年和 2007 年)的经验年限,即 Experience_2006 和 Experience_2007 列。

因此,我的预期输出如下:

DF_final <- 
CompanyID Name Country ISIN Director_2006 Gender_2006 YearBirth_2006 Experience_2006 Director_2007 Gender_2007 YearBirth_2007 Experience_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 0.4 11734844255 M 55 1.4
25830 BANKxxx Austria AT000504 187836811559 F 45 1.5 5524344997 F NA 3.4
25830 BANKxxx Austria AT000504 5524344997 F NA 2.4 5524354997 M 39 2.8
25830 BANKxxx Austria AT000504 5524354997 M 38 1.8 5742347684 M 38 3.5
25830 BANKxxx Austria AT000504 6613115791 M 41 NA 40160443378 M 30 4.3
12339 BANKyyy Belgium AT034003 9855321789 M 44 3.6 9855321789 M 45 4.6
12339 BANKyyy Belgium AT034003 277520199 M NA 1.6 23779351 F 34 3.2

拜托,有人可以告诉我吗?谢谢。

数据

DF <- read.table(text = 
"CompanyID Name Country ISIN Director_2006 Gender_2006 YearBirth_2006 Director_2007 Gender_2007 YearBirth_2007
25830 BANKxxx Austria AT000504 11734844255 M 54 11734844255 M 55
25830 BANKxxx Austria AT000504 187836811559 F 45 5524344997 F NA
25830 BANKxxx Austria AT000504 5524344997 F NA 5524354997 M 39
25830 BANKxxx Austria AT000504 5524354997 M 38 5742347684 M 38
25830 BANKxxx Austria AT000504 6613115791 M 41 40160443378 M 30
12339 BANKyyy Belgium AT034003 9855321789 M 44 9855321789 M 45
12339 BANKyyy Belgium AT034003 277520199 M NA 23779351 F 34",
header = T, stringsAsFactors = F)

DF2 <- read.table(text =
"DirectorID Year YearsExperience
11734844255 2006 0.4
11734844255 2007 1.4
187836811559 2006 1.5
5524344997 2006 2.4
5524344997 2007 3.4
5524354997 2006 1.8
5524354997 2007 2.8
5742347684 2007 3.5
40160443378 2007 4.3
9855321789 2005 2.6
9855321789 2006 3.6
9855321789 2007 4.6
277520199 2006 1.6
23779351 2007 3.2
55443322 2005 2.5
55443322 2006 3.5",
header = T, stringsAsFactors = F)

最佳答案

为了完成,我使用了 dplyr 和 'tidyr' 并与其他函数进行了基准测试。

更新:我在没有使用过滤器和选择函数 myfun4() 的情况下制作了另一个版本的@Jimbou 答案。这是我的基准测试中最快的加入。拉尔夫的答案现在排在第二位。我的初始版本 (myfun3()) 排在第三位。

 microbenchmark::microbenchmark(myfun1(),myfun2(),myfun3(),myfun4())
Unit: milliseconds
expr min lq mean median uq max neval
myfun1() 23.1527 28.36865 31.322275 31.53225 33.69430 52.7319 100
myfun2() 5.2549 5.78445 8.241408 8.25995 9.63870 14.4018 100
myfun3() 7.9534 10.15115 11.976498 11.40415 13.66255 20.9362 100
myfun4() 2.9676 3.40105 5.032863 4.87115 5.56065 19.0217 100

函数代码:

myfun4<-function(){
colnames(DF2)[1]='Director_2007'
DF_final<-left_join(DF,DF2[DF2$Year==2006,-2],by='Director_2007') %>%
left_join(DF2[DF2$Year==2007,-2],by='Director_2007')
n=dim(DF_final)[2]
colnames(DF_final)[(n-1):n]=paste0('YearsExperience_',2006:2007)
}

myfun3<-function(){
DF2_spread<-tidyr::spread(DF2,Year,YearsExperience)[,-2]
colnames(DF2_spread)=c('Director_2007',paste0('Experience_',colnames(df2_spread)[2:3]))
DF_final<-dplyr::left_join(DF,DF2_spread,by='Director_2007')
}

myfun2<-function() {
DF1 <- reshape(DF, direction = "long", varying = names(DF)[5:10], sep = "_", timevar = "Year")
DF3 <- merge(DF1, DF2, all.x = TRUE, by.x = c("Director" , "Year"), by.y = c("DirectorID", "Year"))
DF_final<-reshape(DF3, direction = "wide", v.names = names(DF3)[c(1,7,8,10)], timevar = "Year", sep = "_")
}

myfun1<-function(){
DF %>%
left_join(DF2 %>%
filter(Year == 2006) %>%
select(DirectorID,YearsExperience_2016=YearsExperience),
by=c("Director_2006" = "DirectorID")) %>%
left_join(DF2 %>%
filter(Year == 2007) %>%
select(DirectorID,YearsExperience_2017=YearsExperience),
by=c("Director_2007" = "DirectorID"))
}

关于r - 如何按组连接两个数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50470621/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com