gpt4 book ai didi

R4.0 性能 : dataframes vs lists, 循环与向量化 - 常数向量减法示例

转载 作者:行者123 更新时间:2023-12-04 17:16:03 25 4
gpt4 key购买 nike

这是我第三次阅读 Hadley Wickham 臭名昭着的 Advanced R在第二章中,他解释了为什么遍历列表比遍历数据框更好。一切似乎都是合理的、熟悉的和意料之中的。示例函数从数据帧的每一列中减去一个中位数向量,它基本上是一种居中形式。为了进行测试,我运行了 Grosser、Bumann 和 Wickham 在 Advanced R Solutions 中提供的代码。 .

library(bench)

# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
x <- as.list(x)
x <- subtract_df(x, medians)
list2DF(x)
}


benchmark_medians <- function(ncol) {
df <- create_random_df(nrow = 1e4, ncol = ncol)
medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)

bench::mark(
"data frame" = subtract_df(df, medians),
"list" = subtract_list(df, medians),
time_unit = "ms"
)
}

# Results
results <- bench::press(
ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
benchmark_medians(ncol)
)
#> Running with:
#> ncol
#> 1 1
#> 2 10
#> 3 50
#> 4 100
#> 5 250
#> 6 300
#> 7 400
#> 8 500
#> 9 750
#> 10 1000

library(ggplot2)

ggplot(
results,
aes(ncol, median, col = attr(expression, "description"))
) +
geom_point(size = 2) +
geom_smooth() +
labs(
x = "Number of Columns",
y = "Execution Time (ms)",
colour = "Data Structure"
) +
theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

reprex package 创建于 2021-08-08 (v2.0.1)

同样,没有任何异常,作者解释说:

When working directly with the data frame, the execution time growsquadratically with the number of columns in the input data. This isbecause (e.g.) the first column must be copied n times, the secondcolumn n-1 times, and so on. When working with a list, the executiontime increases only linearly.

Obviously in the long run, linear growth creates shorter run-times,but there is some cost to this strategy — we have to convert betweendata structures with as.list() and list2DF(). Even though this is fastand probably doesn’t hurt much, the improved approach doesn’t reallypay off in this scenario until we get to a data frame that is about300 columns wide (with the exact value depending on thecharacteristics of the system running the code).

但后来,我心想,我经常使用数据框,而不是列表,而且我肯定能够编写一个函数,在包含许多列的数据框上及时运行。我错了!我现在比刚开始时有更多问题:for 循环难道不应该比矢量化操作慢吗? R 的第 4 版是否获得了我不知道的新优化?为什么 sweep 函数这么慢?矩阵/向量化运算不应该至少和循环一样快吗?垃圾收集器会显着减慢速度吗?我在函数内部做错了什么(我肯定在其中一些函数中)?

library(bench)

# Function to generate random dataframe
create_random_df <- function(nrow, ncol) {
random_matrix <- matrix(runif(nrow * ncol), nrow = nrow)
as.data.frame(random_matrix)
}

# For loop function that runs on dataframe
subtract_df <- function(x, medians) {
for (i in seq_along(medians)) {
x[[i]] <- x[[i]] - medians[[i]]
}
x
}

# Same for loop function but that runs on list
subtract_list <- function(x, medians) {
x <- as.list(x)
x <- subtract_df(x, medians)
list2DF(x)
}

# 5 Functions that I thought should be decently fast
my_mapply <- function(x, medians) {
as.data.frame(
mapply(
function(x, medians){x - medians},
x,
medians
)
)
}

my_sweep <- function(x, medians) {
sweep(x, 2, medians)
}

my_apply <- function(x, medians) {
x <- t(apply(x, 1, function(a) a - medians))
as.data.frame(x)
}

my_vectorized <-function(x, medians) {
x <- as.matrix(x)
x <- t(t(x) - medians)
x <- as.data.frame(x)
x
}

my_vectorized2 <- function(x, medians) {
ones <- rep(1, nrow(x))
x_median <- ones %*% t(medians)
x - x_median
}


benchmark_medians <- function(ncol) {
df <- create_random_df(nrow = 1e4, ncol = ncol)
medians <- vapply(df, median, numeric(1), USE.NAMES = FALSE)

bench::mark(
"data frame" = subtract_df(df, medians),
"list" = subtract_list(df, medians),
"my_mapply" = my_mapply(df, medians),
"my_sweep" = my_sweep(df, medians),
# "my_apply" = my_apply(df, medians), # excluded because it is very very slow compared with the others
"my_vectorized" = my_vectorized(df, medians),
"my_vectorized2" = my_vectorized2(df, medians),
time_unit = "ms"
)
}

# Have a quick check on dataframe with only 3 columns
benchmark_medians(3)
#> # A tibble: 6 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 data frame 0.0463 0.0884 12769. 4.44MB 55.9
#> 2 list 0.0640 0.0693 12944. 486.86KB 127.
#> 3 my_mapply 0.152 0.160 5772. 1.05MB 124.
#> 4 my_sweep 1.14 1.18 809. 2.21MB 38.7
#> 5 my_vectorized 0.208 0.212 4253. 1.1MB 89.2
#> 6 my_vectorized2 0.844 0.875 1074. 1.93MB 46.4

# Results
results <- bench::press(
ncol = c(1, 10, 50, 100, 250, 300, 400, 500, 750, 1000),
benchmark_medians(ncol)
)
#> Running with:
#> ncol
#> 1 1
#> 2 10
#> 3 50
#> 4 100
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 5 250
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 6 300
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 7 400
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 8 500
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 9 750
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> 10 1000
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.

library(ggplot2)

ggplot(
results,
aes(ncol, median, col = attr(expression, "description"))
) +
geom_point(size = 2) +
geom_smooth() +
labs(
x = "Number of Columns",
y = "Execution Time (ms)",
colour = "Data Structure"
) +
theme(legend.position = "top")
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

reprex package 创建于 2021-08-08 (v2.0.1)

无论生成数据的特征如何,这些结果在我的所有测试中都非常一致。感谢您提供任何见解,并感谢有关性能或某些功能/功能系列使用的任何建议。

谢谢。

最佳答案

Aren't for loops supposed to be slower than vectorized operations? / Shouldn't matrix/vectorized operations be at least as fast as the loops?

这取决于。但是,我认为对矢量化操作(和 data.frames)存在误解。考虑 my_vectorized2():

my_vectorized2 <- function(x, medians) {
ones <- rep(1, nrow(x))
x_median <- ones %*% t(medians)
x - x_median
}

这里的“瓶颈”是第 4 行。与您假设的相反(我假设),x - mediansnot 触发向量化计算:x 不是矩阵而是数据帧(即列表)。使用 data.frames 进行算术运算是可行的,因为已经实现了方便的方法。从下面的示例中可以看出,这不应该被视为理所当然。

# `+` method for data.frames
`as.data.frame(1:5) + as.data.frame(6:10)`.

1 7
2 9
3 11
4 13
5 15

# no `+` method for lists!
`as.list(1:5) + as.list(6:10)`

Error in as.list(1:5) + as.list(6:10) :
non-numeric argument to binary operator

但是,简而言之,这些方便的 data.frames 方法的效率低于基于向量的计算。通过分析代码,我们可以仔细查看时间花在了哪些地方:

df <- create_random_df(nrow = 1e4, ncol = 1000)
profvis::profvis(my_vectorized2(df, medians))

enter image description here

到目前为止,大部分时间花在 Ops.data.frame() 上,似乎是为 "-" 方法拆分框架。

why is the sweep function so slow? / does the garbage collector slow things down significantly?

我不太熟悉 sweep() 的内部工作原理,但性能不佳主要是由于上面讨论的问题。此外,还会创建多个引用,从而产生副本。垃圾收集似乎也占有相当的份额(这已通过分析得到证实)。这类似于您使用 mapply()apply() 的方法。 my_vectorized() 相对较快,因为计算矢量化的。但是,与 data.frame 之间的转换以及使用 t() 的转置成本很高(因为它会导致复制),因此应该避免。

总而言之,一条建议是,当您努力提高算术运算的性能时,通常使用矩阵而不是 data.frames。

关于R4.0 性能 : dataframes vs lists, 循环与向量化 - 常数向量减法示例,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68701726/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com