gpt4 book ai didi

r - 如何重新排列两个数据帧之间的匹配顺序

转载 作者:行者123 更新时间:2023-12-03 22:35:32 26 4
gpt4 key购买 nike

从昨晚开始我一直在忙这个问题,我不知道该怎么做。

我想要做的是将 df1 字符串与 df2 字符串匹配并找出相似的字符串

我做的是这样的

# a function to arrange the data to have IDs for each string 
normalize <- function(x, delim) {
x <- gsub(")", "", x, fixed=TRUE)
x <- gsub("(", "", x, fixed=TRUE)
idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
names <- unlist(strsplit(as.character(x), delim))
return(setNames(idx, names))
}

# a function to arrange the second df
lookup <- normalize(df2[,1], ",")

# a function to match them and give the IDs
process <- function(s) {
lookup_try <- lookup[names(s)]
found <- which(!is.na(lookup_try))
pos <- lookup_try[names(s)[found]]
return(paste(s[found], pos, sep="-"))
#change the last line to "return(as.character(pos))" to get only the result as in the comment
}

然后我得到这样的结果
res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))

这给了我来自 df1 的每个字符串的行号和来自匹配的 df2 字符串的行号。所以这个数据的输出看起来像这样
> res
$s1
[1] "3-4" "4-1" "5-4"

$s2
[1] "2-4" "3-15" "7-16"

第一列id是df2中与df1中字符串匹配的行号
第二列No是匹配的次数
第三列 ID-col-n 是 df1 中与该字符串匹配的字符串的行号+它们的列名
第四个是来自与该字符串匹配的 df1 第一列的字符串
第五列是与该字符串匹配的第二列的字符串
等等

最佳答案

在这种情况下,我发现将数据切换到宽格式并在将其合并到查找表之前更容易。

你可以试试:

library(tidyr)
library(dplyr)
df1_tmp <- df1
df2_tmp <- df2
#add numerical id to df1_tmp to keep row information
df1_tmp$id <- seq_along(df1_tmp[,1])

#switch to wide and unnest rows with several strings
df1_tmp <- gather(df1_tmp,key="s_val",value="query_string",-id)
df1_tmp <- df1_tmp %>%
mutate(query_string = strsplit(as.character(query_string), ";")) %>%
unnest(query_string)


df2_tmp$IDs. <- gsub("[()]", "", df2_tmp$IDs.)

#add numerical id to df1_tmp to keep row information
df2_tmp$id <- seq_along(df2_tmp$IDs.)

#unnest rows with several strings
df2_tmp <- df2_tmp %>%
mutate(IDs. = strsplit(as.character(IDs.), ",")) %>%
unnest(IDs.)

res <- merge(df1_tmp,df2_tmp,by.x="query_string",by.y="IDs.")

res$ID_col_n <- paste(paste0(res$id.x,res$s_val))
res$total_id <- 1:nrow(res)
res <- spread(res,s_val,value=query_string,fill=NA)
res
#summarize to get required output

res <- res %>% group_by(id.y) %>%
mutate(No=n()) %>% group_by(id.y,No) %>%
summarise_each(funs(paste(.[!is.na(.)],collapse=","))) %>%
select(-id.x,-total_id)

colnames(res)[colnames(res)=="id.y"]<-"IDs"

res$df1_colMatch_counts <- rowSums(res[,-(1:3)]!="")
df2_counts <- df2_tmp %>% group_by(id) %>% summarize(df2_string_counts=n())
res <- merge(res,df2_counts,by.x="IDs",by.y="id")
res


res

IDs No ID_col_n s1 s2 df1_colMatch_counts df2_string_counts
1 1 1 4s1 P41182 1 2
2 2 1 4s1 P41182 1 2
3 3 1 4s1 P41182 1 2
4 4 3 2s2,3s1,5s1 Q9Y6Q9,Q09472 Q92831 2 4
5 15 1 3s2 P54612 1 5
6 16 1 7s2 O15143 1 7

关于r - 如何重新排列两个数据帧之间的匹配顺序,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35707323/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com