gpt4 book ai didi

r - 使用 SparkR 查找生成主键的变量

转载 作者:行者123 更新时间:2023-12-04 11:44:19 25 4
gpt4 key购买 nike

这是我的玩具数据:

df <- tibble::tribble(
~var1, ~var2, ~var3, ~var4, ~var5, ~var6, ~var7,
"A", "C", 1L, 5L, "AA", "AB", 1L,
"A", "C", 2L, 5L, "BB", "AC", 2L,
"A", "D", 1L, 7L, "AA", "BC", 2L,
"A", "D", 2L, 3L, "BB", "CC", 1L,
"B", "C", 1L, 8L, "AA", "AB", 1L,
"B", "C", 2L, 6L, "BB", "AC", 2L,
"B", "D", 1L, 9L, "AA", "BC", 2L,
"B", "D", 2L, 6L, "BB", "CC", 1L)
我在以下链接中的原始问题
https://stackoverflow.com/a/53110342/6762788曾是:
如何获得最少数量的变量的组合,这些变量唯一标识数据框中的观察值,即哪些变量一起可以构成主键?以下答案/代码工作得很好,非常感谢 thelatemail .
nms <- unlist(lapply(seq_len(length(df)), combn, x=names(df), simplify=FALSE), rec=FALSE)
out <- data.frame(
vars = vapply(nms, paste, collapse=",", FUN.VALUE=character(1)),
counts = vapply(nms, function(x) nrow(unique(df[x])), FUN.VALUE=numeric(1))
)
现在,为了让它在大数据上工作,我想把它带到 SparkR。利用这个答案,我如何在 SparkR 中翻译这段代码?如果在 SparkR 中很难,那么我可以使用 sparklyr。

最佳答案

我将上述问题分解成小块,并尝试了以下 SparkR 代码。但是,“counts <- lapply(nms,...” 行似乎很慢。利用此代码,您能否建议进一步提高性能,可能是通过更新“counts <- lapply(nms,...”线。

library(SparkR); library(tidyverse)

df_spark <- mtcars %>% as.DataFrame()

num_m <- seq_len(ncol(df_spark))

nam_list <- SparkR::colnames(df_spark)

combinations <- function(num_m) {
combn(num_m, x=nam_list, simplify=FALSE)
}

nms <- spark.lapply(num_m, combinations) %>% unlist(rec=FALSE)

vars = map_chr(nms, ~paste(.x, collapse = ","))

counts <- lapply(nms, function(x) df_spark %>% SparkR::select(x) %>% SparkR::distinct() %>% SparkR::count()) %>% unlist()

out <- data.frame(
vars = vars,
counts = counts
)

primarykeys <- out %>%
dplyr::mutate(n_vars = str_count(vars, ",")+1) %>%
dplyr::filter(counts==nrow(df)) %>%
dplyr::filter(n_vars==min(n_vars))

primarykeys

关于r - 使用 SparkR 查找生成主键的变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53326497/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com