r - 有没有办法阻止 all.vars() 从 $ 的右侧返回名称？-6ren

r - 有没有办法阻止 all.vars() 从 $ 的右侧返回名称？

转载作者：行者123 更新时间：2023-12-04 09:00:33

base-R 函数 all.vars()返回表达式中的所有名称。例如:

> all.vars( ~ e == M * c^2  )
[1] "e" "M" "c"

有一个 R 运算符不合适。在许多不使用非标准评估和 rlang 等函数的人编写的表达式中，名称将是变量的名称。但是如果这些表达式包含对 $ 的调用，右侧的名称将不是变量，而是索引或列名称。 (我知道巧妙地使用环境和数据屏蔽会模糊变量和列名之间的区别，但这不是这里的重点。) all.vars()没有忽略 $ 右侧的选项.有没有类似的功能，或者我必须编写自己的表达式步行器？基本上，我想要一个函数，如果通过表达式

a $ b + c $ d

将返回 "a"和 "c"。
请求原因
罗兰，你能建议我解释一下我为什么要这个，真是太好了。我经常使用矢量化，因为这是在我对非常大的数据集进行的计算中获得足够速度的唯一方法。因此，我的大量代码包含以下内容:

cond <- ¢ A logical vector of 500,000 elements ¢
v1 ¢ (and v2 etc. ) ¢ <- ¢ Numerical or string vectors of the same length ¢
result_size <- length( cond )
result <- rep( NA, result_size )
result[ cond ] <- f( v1[ cond ], v3[ cond ]
                   , v4[ cond ], v7[ cond ]
                   , v9[ cond ], v10[ cond ] 
                   )
result[ ! cond ] <- g( v2[ ! cond ], v3[ ! cond ]
                     , v4[ ! cond ], v5[ ! cond ]
                     , v6[ ! cond ], v8[ ! cond ]
                     , v10[ ! cond ] 
                     )

我认为，这就是 R 专家所说的拆分工作流。按条件拆分数据，分别处理每组，合并结果。
这种模式迫切需要抽象成一些看起来像条件的东西。 (参见 RD Tennant 的书，Semantics of Programming Languages 中的抽象在这个意义上的例子以及为什么它们是好的。)所以，而不是上面容易发生事故的东西，充满了错误输入和重复的索引和部分向量分配，我希望能够写:

cond <- ¢ A logical vector of 500,000 elements ¢
v1 ¢ (and v2 etc. ) ¢ <- ¢ Numerical or string vectors of the same length ¢
result <- splivif( cond
                 , f( v1, v3, v4, v7, v9, v10 ) 
                 , g( v2, v3, v4, v5, v6, v8, v10 )
                 )

splivif()这里应该解释为另一种条件函数，就像 if() 一样。 , if_else() , ifelse()和 fifelse() ，可能还有六种我还不知道的其他人。它将以与这些函数相同的方式隐藏一些巧妙的细节:即评估其条件，根据该条件将“then”和“else”分支中提到的任何变量拆分为子向量，调用这些子向量上的每个分支，然后结合结果。
这样的功能实际上是我已经实现并经常使用的功能。如上所述，它首先评估其状况。然后它扫描变量的“then”和“else”表达式。那些它认为是正确长度的向量，它会评估。然后它只选择 cond 'th 个元素，并将它们在新环境中绑定(bind)到原始变量的名称。
所以，在这个阶段结束时，我们有一个新的婴儿环境 E其中名称“v1”绑定(bind)到 V1[ cond ] ，其中 V1是原始值 v1 .也在 E是名称“v2”到“v10”的类似绑定(bind)。 splivif()然后在 E 中计算它的“then”和“else”表达式，并将结果组合成一个结果向量。
我需要 all.vars()正在扫描变量的“then”和“else”表达式。我使用嵌套命名列表来存储控制我的计算的值。因此，表达式对这些列表的元素有大量引用，例如 Taxogellation $ IgnoreRepeats和 Taxogellation $ DoInnerSplines .我的问题的核心是 all.vars()然后会错误地返回“IgnoreRepeats”和“DoInnerSplines”作为变量的名称，而实际上它们是索引。
代码来演示为什么我需要矢量化
我于 2020 年 9 月 8 日添加了此部分，以回应 Roland 的评论。它构建了一个样本数据表，代表 50,000 个家庭的收入、年龄和健康状况。每个家庭由一名或两名成年人组成。然后它定义了一个函数， pension() ，计算每个家庭的养老金。这与任何现有政府都会提供的不同，但说明了典型养老金计算的特征。例如，结果通常取决于年龄和健康状况，并且可能取决于收入。这些相关性为任何此类计算设置了最小复杂度，因此也设置了最小时间。
代码然后比较和计时三种应用方式 pension()给所有 50,000 个家庭。它们是: 通过 Tidyverse 分组；通过 data.table 分组；和矢量化。后者使用的事实是运算符和函数，例如 + , | , >和 pmax()可以应用于多个元素的向量，然后对相应元素进行元素处理。
我的计时结果表明，与矢量化相比，Tidyverse 甚至 data.table 都慢得可怜。对于 50,000 个家庭，矢量化速度提高了 40 倍!

library( tidyverse )
library( data.table )
library( assertthat )
library( microbenchmark )
library( purrr )


#1) Create sample data
#=====================

# The code in this section makes a table
# of no_of_groups families. Each family has
# one or two adults. Adults are randomly assigned 
# an income, between 0 and 30,000 pounds;
# an age, between 18 and 99, and a health
# indicator. Each family also gets an integer
# ID. Each adult also gets a number indicating
# whether they are adult 1 or adult 2. 
#
# The sections following this will define
# a function for calculating families'
# pensions. My code will apply it in three
# ways, and time each one. These are: by
# grouping using the Tidyverse; by grouping
# using data.table; and by vectorisation.
# This shows that the Tidyverse and data.table
# are both woefully inefficient compared with
# vectorisation. For 5,000 families, the
# Tidyverse takes 2.5 seconds and data.table
# 2 seconds. Vectorisation takes a mere 50
# milliseconds, 40 times as fast.
#

no_of_groups <- 5000

group_sizes <- sample( c(1,2), no_of_groups, replace=TRUE )

ids <- 1:no_of_groups 

data <- tibble( fam_id=map2( ids, group_sizes, rep ) %>% unlist() )

data <-
  data %>%
    group_by( fam_id ) %>%
    mutate( ad_no = seq_along( fam_id )
          , two_people = length( ad_no ) == 2
          ) %>%
    ungroup() 

data $ income <- runif( nrow( data ), 0, 1 ) * 30000

data $ age <- sample( 18:99, nrow( data ), replace=TRUE )

data $ bad_health <- sample( c(T,F), nrow( data ), replace=TRUE, prob=c(0.1,0.9) )


#2) Function to calculate pension on single family
#=================================================

# two_people is true if the family has two
# people, otherwise false.
# ad1_inc and ad2_inc are the incomes, in
# pounds per year. ad2_inc is NA if there is
# only one person.
# Similarly, ad1_age and ad2_age are ages.
# And ad1_bad_health and ad2_bad_health are
# Booleans indicating whether the person
# has bad health.
# The result is the pension the Government
# gives the family, in pounds per week.
# This is NOT meant to be the same as in any
# existing country's social-security system,
# but exemplifies the kinds of calculation
# such a function needs to do. On our data,
# these will be called several hundred
# thousand times.
#
pension <- function( two_people
                   , ad1_inc, ad2_inc
                   , ad1_age, ad2_age 
                   , ad1_bad_health, ad2_bad_health
                   )
{
  max_age <- 
    ifelse( two_people
          , pmax( ad1_age, ad2_age )
          , ad1_age 
          )

  income <- 
    ifelse( two_people
          , ad1_inc + ad2_inc
          , ad1_inc
          )
 
  bad_health <-
   ifelse( two_people
         , ad1_bad_health | ad2_bad_health
         , ad1_bad_health
         )

  pension_level <-
    case_when( income > 50000 | max_age < 65 ~ "None"
             , max_age > 80 | bad_health ~ "High"
             , max_age >= 65 ~ "Normal"
             )

  pension <- 
    case_when( pension_level == "High" ~ 200.00
             , pension_level == "Normal" ~ 150.00
             , pension_level == "None" ~ 0
             )

  pension
}


#3) Check it works
#=================

pension( F, 40000, NA, 75, NA, F, NA )
# 150.

pension( T, 20000, 20000, 75, 75, F, F )
# 150.

pension( F, 60000, NA, 75, NA, F, NA )
# 0, because of high income.

pension( T, 30000, 30000, 75, 75, F, F )
# 0, because of high income.

pension( F, 60000, NA, 50, NA, F, NA )
# 0, because of low age.

pension( T, 20000, 20000, 75, 75, F, T )
# 200, because of bad health.


#4) Function to calculate all pensions using Tidyverse group-by
#==============================================================

pension_over_all_TV <- function( data )
{
  results <-
    data %>%
      group_by( fam_id ) %>%
      group_map( ~ {
                     assert_that( nrow( .x ) %in% c( 1, 2 ) )
                     two_people <- .x $ two_people[[ 1 ]]
                     pension( two_people
                            , .x $ income[[ 1 ]]
                            , ifelse( two_people, .x $ income[[ 2 ]], NA )
                            , .x $ age [[ 1 ]]
                            , ifelse( two_people, .x $ age[[ 2 ]], NA )
                            , .x $ bad_health[[ 1 ]] 
                            , ifelse( two_people, .x $ bad_health[[ 2 ]], NA )
                            )
                   }
               )
  #
  # A vector of pension values, one per family.

  results
}


#5) Try it and time it
#=====================

pensions_TV <- pension_over_all_TV( data )
#
# Pensions as calculated by Tidyverse grouping.

res <- microbenchmark( pension_over_all_TV( data ), times=3 )
print( res )
#
# Time it. Mean is 2.5 seconds:
#   Unit: seconds
#                        expr      min       lq     mean   median       uq      max neval
#   pension_over_all_TV(data) 2.533073 2.565714 2.584183 2.598356 2.609738 2.621121     3



#6) Function to calculate all pensions using data.table group-by
#===============================================================

pension_over_all_DT <- function( data )
{
  # The function that data.table must apply
  # to each group.
  #
  f <- function( group ) 
  {
    assert_that( nrow( group ) %in% c( 1, 2 ) )
    two_people <- group $ two_people[[ 1 ]]
    pension( two_people
           , group $ income[[ 1 ]]
           , ifelse( two_people, group $ income[[ 2 ]], NA )
           , group $ age [[ 1 ]]
           , ifelse( two_people, group $ age[[ 2 ]], NA )
           , group $ bad_health[[ 1 ]] 
           , ifelse( two_people, group $ bad_health[[ 2 ]], NA )
           )
  }

  data <- as.data.table( data )

  results <-
    data[
        , f( .SD )
        , by=c( "fam_id" ) 
        ]
  #
  # A table with a V1 column containing one
  # pension value per family.

  results
}


#7) Try it and time it
#=====================

pensions_DT <- pension_over_all_DT( data )
#
# Pensions as calculated by data.table grouping.

assert_that( are_equal( unlist( pensions_TV ), pensions_DT $ V1 ) )
#
# Making allowance for the slightly different 
# formats of the results returned by group_map()
# and data.table's grouped operations, check
# that the numbers are the same.

res <- microbenchmark( pension_over_all_DT( data ), times=3 )
print( res )
#
# Time it. Mean is 2 seconds:
#   Unit: seconds
#                        expr      min       lq     mean   median       uq     max neval
#   pension_over_all_DT(data) 1.824391 1.950273 2.155805 2.076154 2.321512 2.56687     3


#8) Function to calculate all pensions using vectorisation
#=========================================================

# This applies pension() to data by using vectorisation.
# It widens data into a table wherein each column is
# a vector corresponding to one of pension()'s arguments.
# It then calls exec() to apply pension() to these
# vectors. I had deliberately written pension() so that
# it would work on vector arguments with more than one
# element.
#
pension_over_all_Vect <- function( data )
{
  data_widened <-
    pivot_wider( data
               , names_from = "ad_no"
               , names_prefix = "ad"
               , values_from = all_of( c("income","age","bad_health") )
               ) %>%
    rename( ad1_inc="income_ad1", ad2_inc="income_ad2",
          , ad1_age="age_ad1", ad2_age="age_ad2"
          , ad1_bad_health="bad_health_ad1", ad2_bad_health="bad_health_ad2"
          ) %>%
    select( -fam_id )
  #
  # A table with one row for each family, and one
  # column for each of pension()'s arguments.

  results <- exec( pension, !!! as.list( data_widened ) )
  #
  # A vector of results: one pension value for
  # each family.

  results
}


#9) Try it and time it
#=====================

pensions_Vect <- pension_over_all_Vect( data )
#
# Returns a list of plausible-looking results.

assert_that( are_equal( unlist( pensions_TV ), pensions_Vect ) )
assert_that( are_equal( unlist( pensions_DT $ V1 ), pensions_Vect ) )
#
# Check that this is equal to the previously-
# calculated results.

res <- microbenchmark( pension_over_all_Vect( data ), times=3 )
print( res )
#
# Time it. The mean is 50 milliseconds.
#   Unit: milliseconds
#                          expr     min       lq    mean  median       uq     max neval
#   pension_over_all_Vect(data) 35.7834 45.23245 50.8431 54.6815 58.37295 62.0644     3

最佳答案

你应该解释为什么你需要这个，以及其他形式的非标准评估会发生什么。您的实际问题可能有更好的解决方案。
我会迅速更换 $与 [[ :

replace_dollar <- function(expr) {
  if (!is.language(expr) || length(expr) == 1L) return(expr)
  if (expr[[1]] == quote(`$`)) {
    expr[[1]] <- quote(`[[`)
    expr[[3]] <- as.character(expr[[3]])
  } else {
    for (i in seq_along(expr)[-1]) 
      expr[[i]] <- replace_dollar(expr[[i]])
  } 
  expr
}

expr <- quote(a $ b + c $ d)
replace_dollar(expr)
# a[["b"]] + c[["d"]]
all.vars(replace_dollar(expr))
#[1] "a" "c"