gpt4 book ai didi

r - 使用 dplyr 和 broom 在训练和测试集上计算 kmeans

转载 作者:行者123 更新时间:2023-12-04 11:16:57 25 4
gpt4 key购买 nike

我正在使用 dplyr 和 broom 来计算我的数据的 kmeans。我的数据包含 X 和 Y 坐标的测试和训练集,并按某个参数值(在本例中为 lambda)分组:

mds.test = data.frame()
for(l in seq(0.1, 0.9, by=0.2)) {
new.dist <- run.distance.model(x, y, lambda=l)
mds <- preform.mds(new.dist, ndim=2)
mds.test <- rbind(mds.test, cbind(mds$space, design[,c(1,3,4,5)], lambda=rep(l, nrow(mds$space)), data="test"))
}

> head(mds.test)
Comp1 Comp2 Transcripts Genes Timepoint Run lambda data
7A_0_AAGCCTAGCGAC -0.06690476 -0.25519106 68125 9324 Day 0 7A 0.1 test
7A_0_AAATGACTGGCC -0.15292848 0.04310200 28443 6746 Day 0 7A 0.1 test
7A_0_CATCTCGTTCTA -0.12529445 0.13022908 27360 6318 Day 0 7A 0.1 test
7A_0_ACCGGCACATTC -0.33015913 0.14647857 23038 5709 Day 0 7A 0.1 test
7A_0_TATGTCGGAATG -0.25826098 0.05424976 22414 5878 Day 0 7A 0.1 test
7A_0_GAAAAAGGTGAT -0.24349387 0.08071162 21907 6766 Day 0 7A 0.1 test

我已经 head上面的测试数据集,但我也有一个名为 mds.train 的数据集其中包含我的训练数据坐标。我的最终目标是为由 lambda 分组的两组运行 k-means, 然后计算训练中心的测试数据的 inside.ss、 between.ss 和 total.ss .感谢 a great resource在扫帚上,我可以通过简单地执行以下操作来为测试集的每个 lambda 运行 kmeans:
test.kclusts  = mds.test %>% 
group_by(lambda) %>%
do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))

然后我可以为每个 lambda 中的每个集群计算这些数据的中心:
test.clusters = test.kclusts %>% 
group_by(lambda) %>%
do(tidy(.$kclust[[1]]))

这就是我被困的地方。我如何计算特征分配,如 reference page 上所示类似(例如 kclusts %>% group_by(k) %>% do(augment(.$kclust[[1]], points.matrix)) ),其中我的 points.matrixmds.test这是一个带有 length(unique(mds.test$lambda)) 的 data.frame应该是多少行?有没有办法以某种方式使用训练集中的中心来计算 glance()基于测试作业的统计数据?

任何帮助将不胜感激!谢谢!

编辑:更新进度。我已经想出了如何汇总测试/培训作业,但在尝试从两组(测试中心的培训作业和培训中心的测试作业)计算 kmeans 统计数据时仍然遇到问题。更新后的代码如下:
test.kclusts  = mds.test %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
test.clusters = test.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
test.clusterings = test.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
test.assignments = left_join(test.kclusts, mds.test) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

train.kclusts = mds.train %>% group_by(lambda) %>% do(kclust=kmeans(cbind(.$Comp1, .$Comp2), centers=length(unique(design$Timepoint))))
train.clusters = train.kclusts %>% group_by(lambda) %>% do(tidy(.$kclust[[1]]))
train.clusterings = train.kclusts %>% group_by(lambda) %>% do(glance(.$kclust[[1]]))
train.assignments = left_join(train.kclusts, mds.train) %>% group_by(lambda) %>% do(augment(.$kclust[[1]], cbind(.$Comp1, .$Comp2)))

test.assignments$data = "test"
train.assignments$data = "train"
merge.assignments = rbind(test.assignments, train.assignments)
merge.assignments %>% filter(., data=='test') %>% group_by(lambda) ... ?

我在下面附上了一个图表,说明了我在这一点上的进展。重申一下,我想计算训练数据中心的测试任务/坐标(中心忽略的图)的 kmeans 统计数据(在平方和、总平方和和平方和之间):
enter image description here

最佳答案

一种方法是...

  • 通过 broom 提取指定集群质心的表(建立在训练集上) .
  • 计算测试集中每个点与使用训练集构建的每个集群质心的距离。可以通过 fuzzyjoin 做到这一点包裹。
  • 与测试点具有最短欧几里得距离的聚类质心表示其分配的聚类。
  • 从那里您可以计算任何感兴趣的指标。

  • 请参阅下面使用从聚类中提取的更简单的数据集 example来自 tidymodels。
    library(tidyverse)
    library(rsample)
    library(broom)
    library(fuzzyjoin)

    # data and train / test set-up
    set.seed(27)
    centers <- tibble(
    cluster = factor(1:3),
    num_points = c(100, 150, 50), # number points in each cluster
    x1 = c(5, 0, -3), # x1 coordinate of cluster center
    x2 = c(-1, 1, -2) # x2 coordinate of cluster center
    )

    labelled_points <-
    centers %>%
    mutate(
    x1 = map2(num_points, x1, rnorm),
    x2 = map2(num_points, x2, rnorm)
    ) %>%
    select(-num_points) %>%
    unnest(cols = c(x1, x2))

    points <-
    labelled_points %>%
    select(-cluster)

    set.seed(1234)

    split <- rsample::initial_split(points)
    train <- rsample::training(split)
    test <- rsample::testing(split)

    # Fit kmeans on train then assign clusters to test
    kclust <- kmeans(train, centers = 3)

    clust_centers <- kclust %>%
    tidy() %>%
    select(-c(size, withinss))

    test_clusts <- fuzzyjoin::distance_join(mutate(test, index = row_number()),
    clust_centers,
    max_dist = Inf,
    method = "euclidean",
    distance_col = "dist") %>%
    group_by(index) %>%
    filter(dist == min(dist)) %>%
    ungroup()
    #> Joining by: c("x1", "x2")

    # resulting table
    test_clusts
    #> # A tibble: 75 x 7
    #> x1.x x2.x index x1.y x2.y cluster dist
    #> <dbl> <dbl> <int> <dbl> <dbl> <fct> <dbl>
    #> 1 4.24 -0.946 1 5.07 -1.10 3 0.847
    #> 2 3.54 0.287 2 5.07 -1.10 3 2.06
    #> 3 3.71 -1.67 3 5.07 -1.10 3 1.47
    #> 4 5.03 -0.788 4 5.07 -1.10 3 0.317
    #> 5 6.57 -2.49 5 5.07 -1.10 3 2.04
    #> 6 4.97 0.233 6 5.07 -1.10 3 1.34
    #> 7 4.43 -1.89 7 5.07 -1.10 3 1.01
    #> 8 5.34 -0.0705 8 5.07 -1.10 3 1.07
    #> 9 4.60 0.196 9 5.07 -1.10 3 1.38
    #> 10 5.68 -1.55 10 5.07 -1.10 3 0.758
    #> # ... with 65 more rows

    # calc within clusts SS on test
    test_clusts %>%
    group_by(cluster) %>%
    summarise(size = n(),
    withinss = sum(dist^2),
    withinss_avg = withinss / size)
    #> # A tibble: 3 x 4
    #> cluster size withinss withinss_avg
    #> <fct> <int> <dbl> <dbl>
    #> 1 1 11 32.7 2.97
    #> 2 2 35 78.9 2.26
    #> 3 3 29 62.0 2.14

    # compare to on train
    tidy(kclust) %>%
    mutate(withinss_avg = withinss / size)
    #> # A tibble: 3 x 6
    #> x1 x2 size withinss cluster withinss_avg
    #> <dbl> <dbl> <int> <dbl> <fct> <dbl>
    #> 1 -3.22 -1.91 40 76.8 1 1.92
    #> 2 0.0993 1.06 113 220. 2 1.95
    #> 3 5.07 -1.10 72 182. 3 2.53

    # plot of test and train points
    test_clusts %>%
    select(x1 = x1.x, x2 = x2.x, cluster) %>%
    mutate(type = "test") %>%
    bind_rows(
    augment(kclust, train) %>%
    mutate(type = "train") %>%
    rename(cluster = .cluster)
    ) %>%
    ggplot(aes(x = x1,
    y = x2,
    color = as.factor(cluster)))+
    geom_point()+
    facet_wrap(~fct_rev(as.factor(type)))+
    coord_fixed()+
    labs(title = "Cluster Assignment on Training and Holdout Datasets",
    color = "Cluster")+
    theme_bw()

    创建于 2021-08-19 由 reprex package (v2.0.0)
    (有关在 tidymodels 中简化此操作的对话的链接,请参阅对 OP 的评论。)

    关于r - 使用 dplyr 和 broom 在训练和测试集上计算 kmeans,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40099628/

    25 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com