gpt4 book ai didi

r - 在组之间查找最接近的匹配,然后是下一个最接近的匹配,直到进行了指定数量的匹配

转载 作者:行者123 更新时间:2023-12-04 12:33:33 24 4
gpt4 key购买 nike

我想找到两个组之间变量的最接近的匹配(最小差异),但如果已经进行了最接近的匹配,则继续进行下一个最接近的匹配,直到进行了 n 次匹配。

我使用了这个 answer 中的代码(下)找到最接近的匹配 value之间Samples对于所有组的每个成对分组(即 Location by VAR )。

但是,有很多重复,顶级匹配为Sample.x 1、2 和 3 可能都是 Sample.y 1.

我想要的是为 Sample.x 找到下一个最接近的匹配项2,然后是 3,依此类推,直到我指定了不同的 ( Sample.x - Sample.y ) 匹配次数。但是Sample.x的顺序不重要,我只是在寻找 Sample.x 之间的前 n 个匹配项和 Sample.y对于给定的分组。

我试图用 dplyr::distinct 做到这一点如下所示。但我不确定如何为 Sample.y 使用不同的条目过滤数据帧,然后再次过滤最小的 DIFF .但是,这不一定会导致唯一的 Sample配对。

有没有一种聪明的方法可以用 dplyr 在 R 中完成这个?这种类型的操作有名称吗?

 df01 <- data.frame(Location = rep(c("A", "C"), each =10), 
Sample = rep(c(1:10), times =2),
Var1 = signif(runif(20, 55, 58), digits=4),
Var2 = rep(c(1:10), times =2))
df001 <- data.frame(Location = rep(c("B"), each =10),
Sample = rep(c(1:10), times =1),
Var1 = c(1.2, 1.3, 1.4, 1.6, 56, 110.1, 111.6, 111.7, 111.8, 120.5),
Var2 = c(1.5, 10.1, 10.2, 11.7, 12.5, 13.6, 14.4, 18.1, 20.9, 21.3))
df <- rbind(df01, df001)
dfl <- df %>% gather(VAR, value, 3:4)

df.result <- df %>%
# get the unique elements of Location
distinct(Location) %>%
# pull the column as a vector
pull %>%
# it is factor, so convert it to character
as.character %>%
# get the pairwise combinations in a list
combn(m = 2, simplify = FALSE) %>%
# loop through the list with map and do the full_join
# with the long format data dfl
map(~ full_join(dfl %>%
filter(Location == first(.x)),
dfl %>%
filter(Location == last(.x)), by = "VAR") %>%
# create a column of absolute difference
mutate(DIFF = abs(value.x - value.y)) %>%
# grouped by VAR, Sample.x
group_by(VAR, Sample.x) %>%
# apply the top_n with wt as DIFF
# here I choose 5,
# and then hope that this is enough to get a smaller n of final matches
top_n(-5, DIFF) %>%
mutate(GG = paste(Location.x, Location.y, sep="-")))

res1 <- rbindlist(df.result)
res2 <- res1 %>% group_by(GG, VAR) %>% distinct(Sample.y)
res3 <- res2 %>% group_by(GG, VAR) %>% top_n(-2, DIFF)

最佳答案

我编辑上面产生 df.result 的代码通过删除行 top_n(-5, DIFF) %>% .现在 res1包含 Sample.x 的所有匹配项和 Sample.y .

然后我用了 res1在下面的代码中。这可能并不完美,但它所做的是找到最接近的 Sample.y匹配 Sample.x 的第一个条目.那么这两个 Samples从数据框中过滤。匹配重复,直到为 Sample.y 的每个唯一值找到匹配项.结果可能会有所不同,具体取决于首先进行的匹配。

  fun <- function(df) {
HowMany <- length(unique(df$Sample.y))
i <- 1
MyList_FF <- list()
df_f <- df
while (i <= HowMany){
res1 <- df_f %>%
group_by(grp, VAR, Sample.x) %>%
filter(DIFF == min(DIFF)) %>%
ungroup() %>%
mutate(Rank1 = dense_rank(DIFF))

res2 <- res1 %>% group_by(grp, VAR) %>% filter(rank(Rank1, ties.method="first")==1)

SY <- as.numeric(res2$Sample.y)
SX <- as.numeric(res2$Sample.x)
res3 <- df_f %>% filter(Sample.y != SY) # filter Sample.y
res4 <- res3 %>% filter(Sample.x != SX) # filter Sample.x
df_f <- res4

MyList_FF[[i]] <- res2

i <- i + 1
}
do.call("rbind", MyList_FF) # https://stackoverflow.com/a/55542822/1670053
}

df <- res1
MyResult <- df %>%
dplyr::group_split(grp, VAR) %>%
map_df(fun)

关于r - 在组之间查找最接近的匹配,然后是下一个最接近的匹配,直到进行了指定数量的匹配,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55229959/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com