gpt4 book ai didi

SQL统计采样

转载 作者:行者123 更新时间:2023-12-03 00:25:19 29 4
gpt4 key购买 nike

我正在寻找一些天才的 SQL 帮助来解决我遇到的一个棘手的统计问题。

我想要做的是从不平衡的用户配置文件组中提取统计上平衡的样本。一次针对单个个人资料属性(例如性别)执行此操作会有些简单。但同时跨多个维度进行这项工作需要一定的复杂性。

为了便于讨论,假设我有这张 table 。

Profile.userID  
Profile.Gender
Profile.Age
Profile.Income

如果我想从混合中提取一组配置文件,以便新的用户样本大致匹配以下所有特征:

50% male, 50% female
30% young, 40% middle age, 40% old
40% low income, 40% middle income, 20% high income

有人对如何实现这一目标有任何想法吗?

最佳答案

您遇到的是采样问题。解决这个问题的关键是将数据分成三个变量组合的单独组。然后,计算每组边际概率的乘积(您的值是边际概率)。然后,对所有 18 个组进行标准化。

例如,Male-Young-Low 组的值为 0.5*0.3*0.4 = 0.06。您对所有 18 个组重复此操作,然后标准化为百分比(即将每个值除以所有值的总和)。结果如下:

Gender  Age     Income  Marg    Normalized
Male Young Low 0.06 5.5%
Male Young Middle 0.06 5.5%
Male Young High 0.03 2.7%
Male Middle Low 0.08 7.3%
Male Middle Middle 0.08 7.3%
Male Middle High 0.04 3.6%
Male Old Low 0.08 7.3%
Male Old Middle 0.08 7.3%
Male Old High 0.04 3.6%
Female Young Low 0.06 5.5%
Female Young Middle 0.06 5.5%
Female Young High 0.03 2.7%
Female Middle Low 0.08 7.3%
Female Middle Middle 0.08 7.3%
Female Middle High 0.04 3.6%
Female Old Low 0.08 7.3%
Female Old Middle 0.08 7.3%
Female Old High 0.04 3.6%

这将成为每个组的采样率。下面是实际进行采样的伪 SQL 代码:

with SamplingRates (
select 'Male' as gender, 'Young' as Age, 'Low' as income, 0.045 as SamplingRate,
union all . .
)
select t.*
from (select t.*,
row_number() over (partition by gender, age, income order by <random>) as seqnum,
count(*) over (partition by gender, age, income) as NumRecs
from table t
) t join
SampleRates sr
on t.gender = sr.gender and t.age = sr.age and t.income = sr.income and
seqnum <= sr.SamplingRate * NumRecs

关于SQL统计采样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12939608/

29 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com