gpt4 book ai didi

r - 从有条件的组中选择一行?

转载 作者:行者123 更新时间:2023-12-04 08:42:29 27 4
gpt4 key购买 nike

我有一个基因行的数据集,其中的基因也成组。我希望根据一些条件为每组选择 1 个基因到一个新的数据框中:

  • 如果组内其他人的得分差>0.02,则选择得分最高的基因
  • 如果组中基因之间的得分差异 <0.02,则选择具有较高 direct_count 的基因
  • 如果 direct_count 相同,则选择具有最高 secondary_count 的基因
  • 如果一切都相同,请选择两个基因。

  • 我一直在尝试在这里使用类似的问题,但是我无法通过设置这么多条件来使其他示例适用于我的代码。
    我拥有的数据如下所示:
      Group Gene      Score     direct_count   secondary_count 
    1 AQP11 0.5566507 4 5
    1 CLNS1A 0.2811747 0 2
    1 RSF1 0.5469924 3 6
    2 CFDP1 0.4186066 1 2
    2 CHST6 0.4295135 1 3
    3 ACE 0.634 1 1
    3 NOS2 0.6345 1 1
    4 Gene1 0.1 10 20
    4 Gene2 0.68 3 1
    4 Gene3 0.7 0 1
    每组基因的输出选择:
     Group Gene      Score     direct_count   secondary_count 
    1 AQP11 0.5566507 4 5 #highest direct_count
    2 CHST6 0.4295135 1 3 #highest secondary_count after matching direct_count
    3 ACE 0.634 1 1 #ACE and NOS2 have matching counts
    3 NOS2 0.6345 1 1
    我正在尝试使用 dplyr::group_by()目前使用 if 语句。
    输入数据:
    structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L), Gene = c("AQP11", 
    "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2", "Gene1", "Gene2", "Gene3"), Score = c(0.5566507,
    0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345, 0.1, 0.68, 0.7), direct_count = c(4L,
    0L, 3L, 1L, 1L, 1L, 1L, 10L, 3L, 0L ), secondary_count = c(5L, 2L, 6L, 2L,
    3L, 1L, 1L, 20L, 1L, 1L)), row.names = c(NA, -10L), class = c("data.table",
    "data.frame"))
    编辑:
    包括 sessioninfo 并且还想在我的真实数据中注意一些行的 direct_count 为 NA和 secondary_count .
    > sessionInfo()
    R version 4.0.2 (2020-06-22)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 18362)

    Matrix products: default

    locale:
    [1] LC_COLLATE=English_United Kingdom.1252
    [2] LC_CTYPE=English_United Kingdom.1252
    [3] LC_MONETARY=English_United Kingdom.1252
    [4] LC_NUMERIC=C
    [5] LC_TIME=English_United Kingdom.1252

    attached base packages:
    [1] stats graphics grDevices utils datasets methods base

    other attached packages:
    [1] forcats_0.5.0 stringr_1.4.0 purrr_0.3.4 readr_1.4.0
    [5] tibble_3.0.4 ggplot2_3.3.2 tidyverse_1.3.0 tidyr_1.1.2
    [9] dplyr_1.0.2 data.table_1.13.2

    loaded via a namespace (and not attached):
    [1] Rcpp_1.0.5 cellranger_1.1.0 pillar_1.4.6 compiler_4.0.2
    [5] dbplyr_1.4.4 tools_4.0.2 jsonlite_1.7.1 lubridate_1.7.9
    [9] lifecycle_0.2.0 gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.8
    [13] reprex_0.3.0 cli_2.1.0 DBI_1.1.0 rstudioapi_0.11
    [17] haven_2.3.1 withr_2.3.0 xml2_1.3.2 httr_1.4.2
    [21] fs_1.5.0 generics_0.0.2 vctrs_0.3.4 gtools_3.8.2
    [25] hms_0.5.3 grid_4.0.2 tidyselect_1.1.0 glue_1.4.1
    [29] R6_2.4.1 fansi_0.4.1 readxl_1.3.1 modelr_0.1.8
    [33] blob_1.2.1 magrittr_1.5 backports_1.1.10 scales_1.1.1
    [37] ellipsis_0.3.1 rvest_0.3.6 assertthat_0.2.1 colorspace_1.4-1
    [41] stringi_1.5.3 munsell_0.5.0 broom_0.7.2 crayon_1.3.4
    编辑真实数据选择问题:
    structure(list(Group = c(2L, 2L, 2L, 2L, 2L), Gene = c("CFDP1", 
    "CHST6", "RNU6-758P", "Gene1", "TMEM170A"), Score = c(0.551740109920502,
    0.598918557167053, 0.564491391181946, 0.567291617393494, 0.616708278656006
    ), direct_count = c(1, 1, 0, 0, 0), secondary_count = c(62,
    6, 1, 1, 2)), row.names = c(NA, -5L), class = c("data.table",
    "data.frame"), .internal.selfref = <pointer: 0x00000183dc6b1ef0>)
    来自本组 Gene1在实际应该是 CHST6 时被选中我找不到原因。
    数据看起来像:
        
    Group Gene Score direct_count secondary_count
    1 2 CFDP1 0.5517401 1 62
    2 2 CHST6 0.5989186 1 6
    3 2 RNU6-758P 0.5644914 0 1
    4 2 Gene1 0.5672916 0 1
    5 2 TMEM170A 0.6167083 0 2
    CHST6最高 direct_count在所有基因中 <0.05 到该组中得分最高的基因,但 Gene1 被选中。

    最佳答案

    group_byfilter来自 tidyverse 是你的 friend 。

    library(dplyr)
    library(tidyr)


    df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
    Gene = c("AQP11", "CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
    3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", "data.frame"))


    new_df <- df %>%
    #first condition
    group_by(Group) %>%
    mutate(max_score_difference = abs(max(Score)-min(Score))) %>%
    filter((max_score_difference > 0.02 & Score == max(Score)) | max_score_difference < 0.02) %>%
    # second condition
    filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & direct_count == max(direct_count))) %>%
    # third condition
    filter(max_score_difference > 0.02 | (max_score_difference < 0.02 & secondary_count == max(secondary_count))) %>%
    ungroup() %>%
    #fourth condition met by max statements in filters above
    select(-max_score_difference) %>%
    data.frame()

    print(new_df)

    关于r - 从有条件的组中选择一行?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64488470/

    27 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com