gpt4 book ai didi

r - data.table 分组列的长度为 "J"中的 1

转载 作者:行者123 更新时间:2023-12-04 02:25:42 30 4
gpt4 key购买 nike

学习中 data.table ,我发现了一个我无法优雅地解决的情况。

前面:lm 的荒谬公式很明显,我试图确定是否可以使用 data.table 中的关键字或特殊运算符轻松解决此细微差别。生态系统。

library(data.table)
mt <- as.data.table(mtcars)
mt[, list(model = list(lm(mpg ~ disp))), by = "cyl"]
# cyl model
# 1: 6 <lm>
# 2: 4 <lm>
# 3: 8 <lm>
mt[, list(model = list(lm(mpg ~ disp + cyl))), by = "cyl"]
# Error in model.frame.default(formula = mpg ~ disp + cyl, drop.unused.levels = TRUE) :
# variable lengths differ (found for 'cyl')

这是因为在区块内部, cyl是长度为 1 的向量,而不是像其余值一样的列:
mt[, list(model = { browser(); list(lm(mpg ~ cyl+disp)); }), by = "cyl"]
# Called from: `[.data.table`(mt, , list(model = {
# browser()
# list(lm(mpg ~ cyl + disp))
# ...
# Browse[1]>
# debug at #1: list(lm(mpg ~ cyl + disp))
# Browse[2]>
disp
# [1] 160.0 160.0 258.0 225.0 167.6 167.6 145.0
# Browse[2]>
cyl
# [1] 6

最直接的似乎是在内部手动延长它作为临时变量或字面上需要的地方:
mt[, list(model = { cyl2 <- rep(cyl, nrow(.SD)); list(lm(mpg ~ cyl2+disp)); }), by = "cyl"]
mt[, list(model = list(lm(mpg ~ rep(cyl, nrow(.SD))+disp))), by = "cyl"]

问:有没有更优雅的方法来处理这个问题?

各种松散相关的问题,激发了我的好奇心(在 DT 对象中嵌入“东西”):
  • Setting column name in "group by" operation with data.table
  • Run a function inside data.table in R
  • Using data.table to create a column of regression coefficients


  • 到目前为止,候选人很多:
    mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
    mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
    mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
    mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
    mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

    最佳答案

    感谢所有候选人。

    mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)]
    mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]
    mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)]
    mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE]
    mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]

    性能(使用这个小模型)似乎有一些小的差异:
    library(microbenchmark)
    microbenchmark(
    c1 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = mt[.I]))), by = .(cyl)],
    c2 = mt[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
    c3 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mt)],
    c4 = mt[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
    c5 = mt[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
    )
    # Unit: milliseconds
    # expr min lq mean median uq max neval
    # c1 3.7328 4.21745 4.584591 4.43485 4.57465 9.8924 100
    # c2 2.6740 3.11295 3.244856 3.21655 3.28975 5.6725 100
    # c3 2.8219 3.30150 3.618646 3.46560 3.81250 6.8010 100
    # c4 2.9084 3.27070 3.620761 3.44120 3.86935 6.3447 100
    # c5 5.6156 6.37405 6.832622 6.54625 7.03130 13.8931 100

    有更大的数据
    mtbigger <- rbindlist(replicate(1000, mtcars, simplify=FALSE))
    microbenchmark(
    c1 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = mtbigger[.I]))), by = .(cyl)],
    c2 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)],
    c3 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=names(mtbigger)],
    c4 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, .SD))), by=cyl, .SDcols=TRUE],
    c5 = mtbigger[, .(model = .(lm(mpg ~ cyl + disp, data = cbind(.SD, as.data.table(.BY))))), by = "cyl"]
    )
    # Unit: milliseconds
    # expr min lq mean median uq max neval
    # c1 27.1635 30.54040 33.98210 32.2859 34.71505 76.5064 100
    # c2 23.9612 25.83105 28.97927 27.5059 30.02720 67.9793 100
    # c3 25.7880 28.27205 31.38212 30.2445 32.79030 105.4742 100
    # c4 25.6469 27.84185 30.52403 29.8286 32.60805 37.8675 100
    # c5 29.2477 32.32465 35.67090 35.0291 37.90410 68.5017 100

    (我猜相对性能比例类似。更好的裁决可能包括更广泛的数据。)

    仅通过中值运行时间,看起来顶部(以很小的幅度)是:
    mtbigger[, .(model = .(lm(mpg ~ cyl + disp))), by =.(cylgroup=cyl)]

    关于r - data.table 分组列的长度为 "J"中的 1,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53092870/

    30 4 0
    Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
    广告合作:1813099741@qq.com 6ren.com