作者热门文章
- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我有一个数据框,其中一列代表购物篮的索引。对于每个篮子,我有另一列标识该篮子中的项目。在数据集中找到唯一篮子的最有效方法是什么?
这是一个使用 dplyr
的示例:
outer_num <- 10000
tmp_df <-
data.frame(basket_index = rep(1:(8*outer_num), each = 2),
items_purchased = rep(rep(c(1, 1, 2, 2, 1, 1, 3, 3), 2), outer_num))
items_purchased_df <-
data.frame(items_purchased = 1:3,
item_name = c("shampoo", "soap", "conditioner"))
tmp_df_2 <-
tmp_df %>%
inner_join(items_purchased_df) %>%
select(basket_index, items_purchased = item_name)
head(tmp_df_2, 16)
# basket_index items_purchased
# 1 1 shampoo
# 2 1 shampoo
# 3 2 soap
# 4 2 soap
# 5 3 shampoo
# 6 3 shampoo
# 7 4 conditioner
# 8 4 conditioner
# 9 5 shampoo
# 10 5 shampoo
# 11 6 soap
# 12 6 soap
# 13 7 shampoo
# 14 7 shampoo
# 15 8 conditioner
# 16 8 conditioner
tmp_fn <- function(tmp_df) {
tmp_df %>%
group_by(basket_index) %>%
mutate(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
group_by(collapsed_purchases) %>%
filter(basket_index == min(basket_index)) %>%
ungroup
}
tmp_fn(tmp_df_2)
# basket_index items_purchased collapsed_purchases
# <int> <fct> <chr>
# 1 1 shampoo shampoo,shampoo
# 2 1 shampoo shampoo,shampoo
# 3 2 soap soap,soap
# 4 2 soap soap,soap
# 5 4 conditioner conditioner,conditioner
# 6 4 conditioner conditioner,conditioner
tmp_df_3 <-
tmp_df_2 %>%
mutate(items_purchased_old = items_purchased,
items_purchased = as.integer(factor(items_purchased)))
microbenchmark::microbenchmark(tmp_fn(tmp_df_2), times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# tmp_fn(tmp_df_2) 20.6301 20.93541 21.98261 22.24193 22.43473 23.77921 10
microbenchmark::microbenchmark(tmp_fn(tmp_df_3), times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# tmp_fn(tmp_df_3) 348.3901 358.0814 507.7983 363.7639 387.2384 1566.903 10
最佳答案
更新:我的结果是 stringsAsFactors = F
.没有它,与 OP 的 tmp_fn()
相比,没有显着的性能提升。功能。
据我所知,group_by + mutate
和 group_by + filter
很慢。这是一种避免这种情况的方法-
# for outer_num <- 10000
system.time(
res <- tmp_df_2 %>%
group_by(basket_index) %>%
summarize(collapsed_purchases = paste0(items_purchased, collapse = ',')) %>%
filter(!duplicated(collapsed_purchases))
# summarize drops one (in this case, the only) grouping level
# so filter is on ungrouped data which is good; also duplicated() is fast enough
)
# user system elapsed
# 4.35 0.00 4.41
res
# A tibble: 3 x 2
# basket_index collapsed_purchases
# <int> <chr>
# 1 1 shampoo,shampoo
# 2 2 soap,soap
# 3 4 conditioner,conditioner
# get desired result
tmp_df_2 %>%
inner_join(res, by = "basket_index")
# basket_index items_purchased collapsed_purchases
# 1 1 shampoo shampoo,shampoo
# 2 1 shampoo shampoo,shampoo
# 3 2 soap soap,soap
# 4 2 soap soap,soap
# 5 4 conditioner conditioner,conditioner
# 6 4 conditioner conditioner,conditioner
data.table
可能会提供更快的速度。
关于r - 有效地找到独特的子集组(例如独特的购物篮),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57618540/
我想实现一种购物篮。想法是,用户能够将某些可拖动图像放入特定区域,然后图像的父级更改为购物车的 div。 基础看起来像这样: 这是我尝试过的: $('#bas
我目前正在创建一个自定义电子商务网站(在 php 中,但这与这个问题并不相关)。 我刚要创建购物篮,无法在以下 2 个选项之间做出决定: 选项 1: 篮子表: 编号 用户 项目 在此选项中,每个用户一
我是一名优秀的程序员,十分优秀!