gpt4 book ai didi

hadoop - 如何从表中获取开始和结束事件

转载 作者:可可西里 更新时间:2023-11-01 16:57:02 24 4
gpt4 key购买 nike

我在一个表中有如下记录

session_id sequence timestamp
1 1 298349
1 2 299234
1 3 234255
2 1 153523
2 2 234524
3 1 123434

我想得到如下结果

session_id  start       end
1 298349 234255
2 153523 234524
3 123434 123434

我怎样才能在 pig 身上做到这一点?

最佳答案

register 'file:$piglib/datafu-1.2.0.jar';

define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();

input_data = load 'so.txt' using PigStorage('\t') as (session_id:int, sequence:int, time:long);

g = group input_data by session_id;

r = foreach g {
s1 = order input_data by sequence asc;
s2 = order input_data by sequence desc;
generate group as session_id, FirstTupleFromBag(s1, null).time as start, FirstTupleFromBag(s2, null).time as end;
}

dump r;

首先按session_id分组,然后按序列升序和降序排序,分别取排序后包的第一个元组。

这利用了 datafu UDF 库 ( http://datafu.incubator.apache.org/docs/datafu/1.2.0/datafu/pig/bags/FirstTupleFromBag.html )

关于hadoop - 如何从表中获取开始和结束事件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28648119/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com