% tidy-6ren">
gpt4 book ai didi

r - 如何在R数据帧中获得独占计数

转载 作者:行者123 更新时间:2023-12-03 23:26:16 26 4
gpt4 key购买 nike

我在 R 中有下面提到的数据框:

DF <- tibble::tribble(
~ID, ~Check,
"I-1", "A1",
"I-2", "A2",
"I-2", "OT",
"I-2", "LP",
"I-3", "A1",
"I-3", "A2",
"I-4", NA,
"I-5", NA,
"I-6", "A1",
"I-6", "OT",
"I-7", "A2"
)

DF2 <- tibble::tribble(
~ID, ~Remarks,
"I-1", "{X1,XR,XT}",
"I-2", "{X2,XR}",
"I-3", NA,
"I-4", "{X1,XR,X2}",
"I-5", "{X1}",
"I-6", "{XT}",
"I-7", "{X1,X2}"
)
使用上面提到的两个数据框,我需要以下格式的输出:
我想确定 Check的唯一计数的地方和 Remark对于每个唯一 ID 和每个组合 Check与另一个 CheckRemark 类似.
注意 - 行的顺序应基于 Exclusive_Count 从高到低的 Check .很有可能是唯一的数量 CheckRemark在我的实际数据框中可能会有所不同。 (即 10 个独特的 Remark 和 5 个 Check ,类似这样)
DF_输出<-
Remark   Exclusive_Count  %       X1  X2  XR  XT  Check  Exclusive_Count  %          A1  A2  OT  LP
Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
X1 1 33.33% 0 2 2 1 A1 1 25.00 0 1 1 0
X2 0 0.00% 2 0 1 0 A2 1 25.00% 1 0 1 1
XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 0 1
XT 1 33.33% 1 0 1 0 LP 0 0.00% 0 1 1 0
Total 3 100.00% 5 4 4 2 Total 4 100.00% 2 3 3 2

最佳答案

OP 已请求规范答案。所以,我创建了一个函数 get_exclusive_counts()它采用任何 tibble、data.frame 或 data.table 的前两列,其中第一列包含 ID,第二列包含有效负载,例如,Check , 长格式。
该函数与列名称无关,并且可以处理有效负载列中任意数量的不同项目。它为每个输入小标题返回一个 data.table:

get_exclusive_counts(DF)
    Check Exclusive_Count       % A1 A2 LP OT
1: Blank 2 50.00% 0 0 0 0
2: A1 1 25.00% 0 1 0 1
3: A2 1 25.00% 1 0 1 1
4: LP 0 0.00% 0 1 0 1
5: OT 0 0.00% 1 1 1 0
6: Totals 4 100.00% 2 3 2 3

对于第二个用例 DF2 ,需要事先将有效负载拆分为单独的行:
library(magrittr)
DF2 %>%
dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
tidyr::separate_rows(Remarks) %>%
get_exclusive_counts()
   Remarks Exclusive_Count       % X1 X2 XR XT
1: Blank 1 33.33% 0 0 0 0
2: X1 1 33.33% 0 2 2 1
3: XT 1 33.33% 1 0 1 0
4: X2 0 0.00% 2 0 2 0
5: XR 0 0.00% 2 2 0 1
6: Totals 3 100.00% 5 4 5 2

请注意,结果表的第一列的名称已从输入 data.frame 中保留。
OP 提到了 Remarks 的数量和 Check可能不同。因此, cbind() 真的没有意义。两个结果表,因为这只会在行数相同的情况下给出合理的结果。
此外,OP 的预期结果有一些列名重复(至少 Exclusive_Count% ,也许更多),这表明结果可能不用于进一步处理,而仅用于显示/打印。
并排打印结果
但是,我创建了一个函数 get_exclusive_counts_side_by_side()打印调用 get_exclusive_counts() 的结果
  • 对于任意数量的输入数据集,
  • 具有不同的行数,以及
  • 与最后几行 ( Totals ) 对齐。

  • 该函数返回一个带有字符列的 data.table。
    下面的调用将重现 OP 的预期结果:
    get_exclusive_counts_side_by_side(
    DF2 %>%
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
    tidyr::separate_rows(Remarks),
    DF)
       Remarks Exclusive_Count       % X1 X2 XR XT  Check Exclusive_Count       % A1 A2 LP OT
    1: Blank 1 33.33% 0 0 0 0 Blank 2 50.00% 0 0 0 0
    2: X1 1 33.33% 0 2 2 1 A1 1 25.00% 0 1 0 1
    3: XT 1 33.33% 1 0 1 0 A2 1 25.00% 1 0 1 1
    4: X2 0 0.00% 2 0 2 0 LP 0 0.00% 0 1 0 1
    5: XR 0 0.00% 2 2 0 1 OT 0 0.00% 1 1 1 0
    6: Totals 3 100.00% 5 4 5 2 Totals 4 100.00% 2 3 2 3

    这是另一个用例,用于证明它可以处理不同的行和任意数量的输入数据集:
    get_exclusive_counts_side_by_side(
    DF,
    DF3 %>%
    dplyr::mutate(Remarks = stringr::str_remove_all(Remarks, "[{}]")) %>%
    tidyr::separate_rows(Remarks),
    DF)
        Check Exclusive_Count       % A1 A2 LP OT Remarks Exclusive_Count       % X1 X2 XR XT Y2 Y3 Y4  Check Exclusive_Count       % A1 A2 LP OT
    1: Blank 2 50.00% 0 0 0 0 X1 2 50.00% 0 2 2 1 1 1 0 Blank 2 50.00% 0 0 0 0
    2: A1 1 25.00% 0 1 0 1 Blank 1 25.00% 0 0 0 0 0 0 0 A1 1 25.00% 0 1 0 1
    3: A2 1 25.00% 1 0 1 1 XT 1 25.00% 1 0 1 0 0 0 0 A2 1 25.00% 1 0 1 1
    4: LP 0 0.00% 0 1 0 1 X2 0 0.00% 2 0 2 0 0 0 0 LP 0 0.00% 0 1 0 1
    5: OT 0 0.00% 1 1 1 0 XR 0 0.00% 2 2 0 1 0 0 0 OT 0 0.00% 1 1 1 0
    6: Y2 0 0.00% 1 0 0 0 0 1 1
    7: Y3 0 0.00% 1 0 0 0 1 0 0
    8: Y4 0 0.00% 0 0 0 0 1 0 0
    9: Totals 4 100.00% 2 3 2 3 Totals 4 100.00% 7 4 5 2 3 2 1 Totals 4 100.00% 2 3 2 3

    功能定义
    代码看起来相当庞大,但有一半的行是注释。因此,代码应该是不言自明的。
    此外,大约一半的代码行是由于 OP 的附加要求,例如 % 列或 Totals 行。
    get_exclusive_counts <- function(DF) {
    library(data.table)
    library(magrittr)
    # make copy of first 2 cols to preserve original attributes of DF
    DT <- as.data.table(DF[, 1:2])
    # retain original column names
    old <- colnames(DT)[1:2]
    # rename colnames in copy for convenience of programming
    setnames(DT, c("id", "val")) # col 1 contains id, col 2 contains payload
    # aggregate by id to find exclusive counts = ids with only one element
    tmp <- DT[, .N, keyby = id][N == 1L]
    # create table of exclusive counts by joining and aggregating
    excl <- DT[tmp, on = .(id)][, .(Exclusive_Count = .N), keyby = val] %>%
    # append column of proportions, will be formatted after computing Totals
    .[, `%` := Exclusive_Count / sum(Exclusive_Count)]
    # anti-join to find remaining rows
    rem <- DT[!tmp, on = .(id)]
    # create co-occurrence matrix in long format by a self-join
    coocc <- rem[rem, on = .(id), allow.cartesian = TRUE] %>%
    # reshape to wide format and compute counts of co-occurrences w/o diagonals
    dcast(val ~ i.val, length, subset = .(val != i.val))
    # build final result table by merging both subresults
    merge(excl, coocc, by = "val", all = TRUE) %>%
    # replace NA counts by 0
    .[, lapply(.SD, nafill, fill = 0L), by = val] %>%
    # clean-up: order by decreasing Exclusive_Counts %>%
    .[order(-Exclusive_Count)] %>%
    # append Totals row
    rbind(., .[, c(.(val = "Totals"), lapply(.SD, sum)), .SDcols = is.numeric]) %>%
    # clean-up: format proportion as percentage
    .[, `%` := sprintf("%3.2f%%", 100 * `%`)] %>%
    # clean-up: Replace <NA> by "Blank" in val column
    .[is.na(val), val := "Blank"] %>%
    # rename val column
    setnames("val", old[2]) %>%
    # return result visibly
    .[]
    }
    这是 get_exclusive_counts_side_by_side()的代码:
    get_exclusive_counts_side_by_side <- function(...) {
    library(data.table)
    library(magrittr)
    # process input, return list of subresults
    ec_list<- list(...) %>%
    lapply(get_exclusive_counts)
    # create row indices for maximum rows
    rid <- ec_list %>%
    lapply(nrow) %>%
    Reduce(max, .) %>%
    {data.table(.rowid = 1:.)}
    # combine subresults
    ec_list %>%
    # insert empty rows if necessary
    lapply(function(.x) .x[
    , .rowid := .I][
    # but align last row
    .rowid == .N, .rowid := nrow(rid)][
    rid, on =.(.rowid)][
    , .rowid := NULL]
    ) %>%
    # all data.tables have the same number of rows, now cbind()
    do.call(cbind, .) %>%
    # replace all NA by empty character strings
    .[, lapply(.SD, . %>% as.character %>% fifelse(is.na(.), "", .))]
    }
    补充说明
    如果我理解正确,独占计数是指仅分配了一项(或 NA)的 ID。这是相当直接的计算
  • 计算每个 ID 的项目数,
  • 仅选择一项的 ID,
  • 选择输入 data.frame 中属于这些 ID 的行(使用连接),和
  • 计算独占行子集中项目的出现次数。

  • 此外,该功能处理 OP 的附加要求,这些要求超出了排他计数的识别:
  • 添加剩余非排他性的共现计数矩阵
    行,
  • 在特定位置添加一列排他计数的比例并将其格式化为百分比,
  • 添加总计行,
  • 更换 NA s 由零或 "Blank" , 分别

  • 数据
    DF <- tibble::tribble(
    ~ID, ~Check,
    "I-1", "A1",
    "I-2", "A2",
    "I-2", "OT",
    "I-2", "LP",
    "I-3", "A1",
    "I-3", "A2",
    "I-4", NA,
    "I-5", NA,
    "I-6", "A1",
    "I-6", "OT",
    "I-7", "A2"
    )

    DF2 <- tibble::tribble(
    ~ID, ~Remarks,
    "I-1", "{X1,XR,XT}",
    "I-2", "{X2,XR}",
    "I-3", NA,
    "I-4", "{X1,XR,X2}",
    "I-5", "{X1}",
    "I-6", "{XT}",
    "I-7", "{X1,X2}"
    )

    DF3 <- tibble::tribble(
    ~ID, ~Remarks,
    "I-1", "{X1,XR,XT}",
    "I-2", "{X2,XR}",
    "I-3", NA,
    "I-4", "{X1,XR,X2}",
    "I-5", "{X1}",
    "I-6", "{XT}",
    "I-7", "{X1,X2}",
    "I-8", "{X1,Y2,Y3}",
    "I-9", "{Y2,Y4}",
    "I10", "{X1}",
    )

    关于r - 如何在R数据帧中获得独占计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66016867/

    26 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com