gpt4 book ai didi

r - 为什么同一个查询使用 dplyr 在不同的 R session 上返回不同的结果?

转载 作者:行者123 更新时间:2023-12-05 09:08:49 26 4
gpt4 key购买 nike

当我和我的同事一起做一个项目时,涉及使用 tidyverse 的包 dplyr 来操作数据框,我注意到即使我们使用相同的代码和相同的数据。

来自两个 R session 的 session 信息:

桌面:

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=Portuguese_Brazil.1252
[2] LC_CTYPE=Portuguese_Brazil.1252
[3] LC_MONETARY=Portuguese_Brazil.1252
[4] LC_NUMERIC=C
[5] LC_TIME=Portuguese_Brazil.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] forcats_0.4.0 stringr_1.4.0 dplyr_0.8.3
[4] purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[7] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.3.0
[10] sp_1.3-2

RStudio 云

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] randomNames_1.4-0.0 plotly_4.9.2.1 lubridate_1.7.9
[4] openintro_2.0.0 usdata_0.1.0 cherryblossom_0.1.0
[7] airports_0.1.0 leaflet_2.0.3 forcats_0.5.0
[10] stringr_1.4.0 dplyr_1.0.0 purrr_0.3.4
[13] readr_1.3.1 tidyr_1.1.0 tibble_3.0.2
[16] ggplot2_3.3.2 tidyverse_1.3.0 shinydashboard_0.7.1
[19] shiny_1.5.0

使用 Iris 的可重现示例:


library(tidyverse)

#lets say that each flower on the data frame iris had a name


iris$name <-c("Jackson","al-Jalali","Tamblyn","Beckham","Knipp","Chen","el-Hares","al-Shaheen","Boyd","Gurung","Demolli","el-Omer","Christensen","Ayele","Wilson","Arriaga","el-Vaziri","Aragon","Demoudt","Gray","Raiburn","al-Aziz","Phouthavong","John","Bortolutti","Ellis","Williams","Gonzalez","Valenzuela","Smith","el-Ishak","al-Tabet","Perez","Watson","el-Imam","Kerr","Morales-Gonzale","Bell","Haines","Gutierrez","SalcidoIbarra","Jimenez","al-Bari","Gosnell","Kocsis","Pratt","Tenorio","Merriweather","Damiana","al-Jafari","Edwards","Mujkic","Lam","Russell","Christy","el-Zahra","al-Lodi","Murry","Haro","Chu","Espinoza","Sahnd","Sands","el-Nagi","Dickerson","Carlton","Flood","Tran","Cruz","Yu","West","Franklin","Dupree","Delger","White","Olivero","Sem","al-Muhammed","Shafer","Senette","Hudson","Lattimer","Lyons","Grim","Grove","Truong","LynnGoin","el-Hassan","Cline","Adams","Watkins","Littlejohn","Gatzke","Vandyke","Yocum","Ng","Ortiz","Schwartz","Torres","Hernandez","Krien","Thyfault","al-Ansari","el-Shahin","el-Hashemi","Hereford","Navajo","Bickel","Saiganesh","Polson","Bates","Griffith","Krueger","Yang","AlAmin","Linthicum","Gallegos","Murphy","Johnson","Basurto","Rendon","el-Minhas","Khan","al-Ebrahim","Macgilvray","Farrell","Ricord","Lovato","Sanchez","Palmer","Turner","al-Fares","Ball","Ji","OrtizMorales","Fan","Isaac","Barger","Eddins","Fabrizio","Hedin","Brodsky","Leggett","Le","Guichard","al-Rahim","Benefiel","Sullivan","Milender","Smith")


#and that for some reason the same flower can appear more than once in the data frame
sample_index<-c(14,50,118,43,14,118,90,91,91,92,137,99,72,26,
7,137,78,81,43,103,117,76,143,32,109,7,137,74,
23,53,135,53,34,69,72,76,63,141,97,91,38,21,
41,90,60,16,116,94,6,86,86,39,118,50,34,4,
13,69,127,52,22,89,25,35,112,30,140,121,110,64,
142,67,122,79,85,136,51,74,106,98,74,127,17,46,
54,110,94,79,24,113,107,135,102,135,5,70,16,24,
32,21)

iris_big <- rbind(iris,iris[sample_index,])

我想知道每个物种有多少独特的花,所以我写了以下查询:

 
iris_big %>%
group_by(name,Species) %>%
count() %>%
ungroup() %>%
count(Species)

问题是,它返回两个不同的结果,一个在我的桌面上,另一个在我 friend 的桌面上(他使用的是 Rstudio Cloud)。

我的桌面:

# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50

Rstudio 云:


Using `n` as weighting variable
ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
# A tibble: 3 x 2
Species n
<fct> <int>
1 setosa 83
2 versicolor 80
3 virginica 87

我最终通过使用以下查询解决了这个问题:

iris_big %>% 
group_by(name,Species) %>%
count() %>%
ungroup() %>%
select(Species) %>%
group_by(Species) %>%
count()

# A tibble: 3 x 2
# Groups: Species [3]
Species n
<fct> <int>
1 setosa 50
2 versicolor 50
3 virginica 50

但我想知道为什么会这样。

最佳答案

(首先,我将此作为备选答案提交,因为我的 first answer(关于 sample.int 在 R-3.5 和 R-3.6 之间的变化)似乎仍然与“为什么相同的查询在不同的 R session 中返回不同的结果” 的问题;这不是导致症状的原因,但从第一个开始就很容易出现您问题的版本使用了 sample。相反,这里真正的罪魁祸首是由于 dplyr 中同样“主要”的版本更改。)

dplyr::count 的行为发生了重大变化。

在 dplyr-0.8.3 中,?count 说:

      wt: (Optional) If omitted (and no variable named 'n' exists in
the data), will count the number of rows. If specified, will
perform a "weighted" tally by summing the (non-missing)
values of variable 'wt'. A column named 'n' (but not 'nn' or
'nnn') will be used as weighting variable by default in
'tally()', but not in 'count()'. This argument is
automatically quoted and later evaluated in the context of
the data frame. It supports unquoting. See
'vignette("programming")' for an introduction to these
concepts.

在 dplyr-1.0.0 中:

      wt: <'data-masking'> Frequency weights. Can be a variable (or
combination of variables) or 'NULL'. 'wt' is computed once
for each unique combination of the counted variables.

• If a variable, 'count()' will compute 'sum(wt)' for each
unique combination.

• If 'NULL', the default, the computation depends on
whether a column of frequency counts 'n' exists in the
data frame. If it exists, the counts are computed with
'sum(n)' for each unique combination. Otherwise, 'n()' is
used to compute the counts. Supply 'wt = n()' to force
this behaviour even if you have an 'n' column in the data
frame.

要看的重要部分是在 0.8.3 中,它说 “名为 'n' 的列 ... 将用于 ... 在 'tally()' 中而不是在 'count() '”。但是,在 1.0.0 中,它不包含该措辞。我使用 R-3.5.3/dplyr-0.8.3 和 R-4.0.2/dplyr-1.0.0 重现了您的结果。

绕过它的方法是以下两种方法之一:

  1. 使用count(..., wt=n()):

    R.version$version.string
    # [1] "R version 3.5.3 (2019-03-11)"
    iris_big %>%
    group_by(name,Species) %>%
    count() %>%
    ungroup() %>%
    count(Species, wt = n())
    # # A tibble: 3 x 2
    # Species n
    # <fct> <int>
    # 1 setosa 50
    # 2 versicolor 50
    # 3 virginica 50
    R.version$version.string
    # [1] "R version 4.0.2 (2020-06-22)"
    iris_big %>%
    group_by(name,Species) %>%
    count() %>%
    ungroup() %>%
    count(Species, wt = n())
    # # A tibble: 3 x 2
    # Species n
    # <fct> <int>
    # 1 setosa 50
    # 2 versicolor 50
    # 3 virginica 50
  2. 转向在分组中使用tally,如

    iris_big %>%
    group_by(name,Species) %>%
    count() %>%
    group_by(Species) %>%
    tally()

或者你可以选择另一个选项:

  1. 意识到这是问题 dplyr#5298 ,已在尚未发布的 dplyr-1.0.1 中修复(我不知道时间表)。这样,RStudio Cloud 用户可以选择 dplyr 的 github 版本以从 dplyr#5349 中受益。 ,一个已经被合并的 PR。这应该将 count 的行为恢复到 1.0.0 之前的行为(尽管 Hadley's opinion 在这件事上)。

关于r - 为什么同一个查询使用 dplyr 在不同的 R session 上返回不同的结果?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62941250/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com