gpt4 book ai didi

hadoop - 在 Hive 中按列采样

转载 作者:可可西里 更新时间:2023-11-01 15:27:13 33 4
gpt4 key购买 nike

给定一个 Hive 表如下:

> desc T;
dim1 string
dim2 string
dim3 string
value1 int
value2 int

我尝试按组 (dim1, dim2, dim3) 随机抽样 1,000 行。

一种方法是:

# bash
for dim1 in dim1_1, dim1_2; do
for dim2 in dim2_1, dim2_2; do
for dim3 in dim3_1, dim3_2; do
hive -e "select * from T where dim1=$dim1 and dim2=$dim2 and dim3=$dim3 limit 1000;"
done done done

然后会依次执行2^3=8个查询。有没有更有效的方法?

最佳答案

with    dim as
(
select struct(d1.v,d2.v,d3.v) as vals

from (select 1) x
lateral view explode (array(1,2)) d1 as v -- dim1_1 = 1 dim1_2 = 2
lateral view explode (array(3,4)) d2 as v -- dim2_1 = 3 dim2_2 = 4
lateral view explode (array(5,6)) d3 as v -- dim3_1 = 5 dim3_2 = 6
)

select *

from (select row_number () over
(
partition by dim1,dim2,dim3
order by rand()
) as rn
,*

from t

where struct(t.dim1,t.dim2,t.dim3) in (select vals from dim)
) t

where rn <= 1000
;

关于hadoop - 在 Hive 中按列采样,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42711389/

33 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com