R 数据表 : subgroup weighted percent of group-6ren

R 数据表 : subgroup weighted percent of group

转载作者：行者123 更新时间：2023-12-04 10:05:21

26

4

我有一个 data.table喜欢:

library(data.table)
widgets <- data.table(serial_no=1:100, 
                      color=rep_len(c("red","green","blue","black"),length.out=100),
                      style=rep_len(c("round","pointy","flat"),length.out=100),
                      weight=rep_len(1:5,length.out=100) )

虽然我不确定这是最多的 data.table方式，我可以使用 table 按组计算子组频率和 length只需一步——例如，回答“有多少红色小部件是圆形的？”的问题。

编辑:此代码未提供正确答案

# example A
widgets[, list(style = unique(style), 
               style_pct_of_color_by_count = 
                 as.numeric(table(style)/length(style)) ), by=color]

#    color  style style_pct_of_color_by_count
# 1:   red  round                        0.32
# 2:   red pointy                        0.32
# 3:   red   flat                        0.36
# 4: green pointy                        0.32
# ...

但是我不能用这种方法来回答诸如“按重量计算，红色小部件有多少是圆形的？”之类的问题。我只能想出一个两步走的方法:

# example B
widgets[,list(cs_weight=sum(weight)),by=list(color,style)][,list(style, style_pct_of_color_by_weight=cs_weight/sum(cs_weight)),by=color]

#    color  style style_pct_of_color_by_weight
# 1:   red  round                    0.3466667
# 2:   red pointy                    0.3466667
# 3:   red   flat                    0.3066667
# 4: green pointy                    0.3333333
# ...

我正在寻找对 B 和 A(如果可以改进)的单步方法，以加深我对 data.table 的理解的解释。按组操作的语法。请注意，此问题与 Weighted sum of variables by groups with data.table 不同因为我的涉及子组并避免多个步骤。 TYVM。

最佳答案

这几乎是一个步骤:

# A
widgets[,{
    totwt = .N
    .SD[,.(frac=.N/totwt),by=style]
},by=color]
    # color  style frac
 # 1:   red  round 0.36
 # 2:   red pointy 0.32
 # 3:   red   flat 0.32
 # 4: green pointy 0.36
 # 5: green   flat 0.32
 # 6: green  round 0.32
 # 7:  blue   flat 0.36
 # 8:  blue  round 0.32
 # 9:  blue pointy 0.32
# 10: black  round 0.36
# 11: black pointy 0.32
# 12: black   flat 0.32

# B
widgets[,{
    totwt = sum(weight)
    .SD[,.(frac=sum(weight)/totwt),by=style]
},by=color]
 #    color  style      frac
 # 1:   red  round 0.3466667
 # 2:   red pointy 0.3466667
 # 3:   red   flat 0.3066667
 # 4: green pointy 0.3333333
 # 5: green   flat 0.3200000
 # 6: green  round 0.3466667
 # 7:  blue   flat 0.3866667
 # 8:  blue  round 0.2933333
 # 9:  blue pointy 0.3200000
# 10: black  round 0.3733333
# 11: black pointy 0.3333333
# 12: black   flat 0.2933333

工作原理:在进入更精细的组( color 和 color )以制表之前，为顶级组( style )构造分母。

备择方案。如 style s 在每个 color 内重复这仅用于显示目的，请尝试 table :

# A
widgets[,
  prop.table(table(color,style),1)
]
#        style
# color   flat pointy round
#   black 0.32   0.32  0.36
#   blue  0.36   0.32  0.32
#   green 0.32   0.36  0.32
#   red   0.32   0.32  0.36

# B
widgets[,rep(1L,sum(weight)),by=.(color,style)][,
  prop.table(table(color,style),1)
]

#        style
# color        flat    pointy     round
#   black 0.2933333 0.3333333 0.3733333
#   blue  0.3866667 0.3200000 0.2933333
#   green 0.3200000 0.3333333 0.3466667
#   red   0.3066667 0.3466667 0.3466667

对于 B，这会扩展数据，以便每个重量单位都有一个观测值。对于大数据，这样的扩展将是一个坏主意(因为它消耗太多内存)。另外， weight必须是整数；否则，它的总和将被静默地截断为 1(例如，尝试 rep(1,2.5) # [1] 1 1)。

关于R 数据表 : subgroup weighted percent of group，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30944116/

26

4

0

文章推荐： apache - 如何通过移动设备访问Apache虚拟主机？

文章推荐： regex - Ant replaceregexp 任务 - 匹配和替换 HTML 注释 block

文章推荐： forms - 无法获得 Auth 登录以使用 CakePHP 2.0

文章推荐： r - 使用可选参数在 R 中编写 ggplot 函数

python - 对包含多个以零分隔的 'subgroups' 的数据框列进行统计
我有一个 Pandas 数据框，其中一列包含以下数据。零充当定界符/分隔符，非零值是“子组”的值。我想为每个组计算一些统计数据(即 len(持续时间)，平均值)并将其保存为新的数据帧: ts = pd
JavaScript 正则表达式 : ignore subgroup when matching
我正在尝试将字符串与中间的可选部分相匹配。示例字符串是: 20160131_abc.pdf 20160131_abc_xx.pdf 20160131_def.pdf 结果应包含文件名(不带可选的 _
R 数据表 : subgroup weighted percent of group
我有一个 data.table喜欢: library(data.table) widgets <- data.table(serial_no=1:100,
python - Pandas : how to apply functions per subgroups
我有一个简单的数据框，其中包含国籍、职业和年龄列。欧盟、美洲、亚洲的国籍热编码为 0、1、2。对于每个职业，我想找到每个国籍的百分比例如:67% 的医生是欧洲人，33% 是亚洲人。 import p
python - Pandas 数据帧 : get average of first rows of each subgroup within a group
我有一个如下所示的 pandas 数据框: df = pd.DataFrame({'Person_ID': [1,1,1,1,1,1,2,2,2,3,3,3,3],
python - Pandas 数据帧 : get average of first rows of each subgroup within a group
我有一个如下所示的 pandas 数据框: df = pd.DataFrame({'Person_ID': [1,1,1,1,1,1,2,2,2,3,3,3,3],
regex - python 3 : subgroup matches not working for OR '|' joined regular expressions
我正在尝试解析 csv 文件中的一些行以提取一系列数字。有两种可能的格式，单独的数字，或嵌入在较长代码中的数字。这是一个代码示例，说明我在做什么，以及出了什么问题: # regex works >>

首页

博学

6Ren·AI

商城

R 数据表 : subgroup weighted percent of group