gpt4 book ai didi

hadoop - Q : how to unnest bags from complicated data structure in PIG

转载 作者:可可西里 更新时间:2023-11-01 16:37:43 26 4
gpt4 key购买 nike

原来我有这样的结构:

+-------+-------+----+----+----+-----+
| time | type | s1 | s2 | id | p1 |
+-------+-------+----+----+----+-----+
| 10:30 | send | a | b | 1 | 110 |
| 10:35 | send | c | d | 1 | 120 |
| 10:31 | reply | e | f | 3 | 221 |
| 10:33 | reply | a | c | 1 | 210 |
| 10:34 | send | a | a | 3 | 113 |
| 10:32 | reply | c | d | 3 | 157 |
+-------+-------+----+----+----+-----+

我想规范化表格:

  1. 按 id 对条目进行分组,
  2. 在每个组中,找出最早的发送类型条目,
  3. 用最旧发送类型条目的值替换其他条目的 s1、s2

```

+-------+-------+----+----+----+-----+
| time | type | s1 | s2 | id | p1 |
+-------+-------+----+----+----+-----+
| 10:30 | send | a | b | 1 | 110 |
| 10:35 | send | a | b | 1 | 120 |
| 10:33 | reply | a | b | 1 | 210 |
| 10:31 | reply | a | a | 3 | 221 |
| 10:34 | send | a | a | 3 | 113 |
| 10:32 | reply | a | a | 3 | 157 |
+-------+-------+----+----+----+-----+

这就是我试图解决问题的方式:

events_groupby_id = GROUP events BY id;
events_normalized = FOREACH events_groupby_id {
f_reqs = FILTER events BY type matches 'send';
o_reqs = ORDER events BY time ASC;
req = LIMIT o_reqs 1;
GENERATE req, events;
};

我卡在这里了。因为我发现events_normalized变成了一个复杂的嵌套包的结构,不知道如何正确压平。

events_normalized |要求:包{:元组()} |事件:包{:元组()}

从这里开始,我应该怎么做才能实现我想要的数据结构?如果有人能帮助我,我将不胜感激。谢谢你。

最佳答案

您可以使用 FLATTEN 取消嵌套 events_normalized 中的包:

events_flattened = FOREACH events_normalized GENERATE 
FLATTEN(req),
FLATTEN(events);

这在 reqevents 之间创建了一个叉积,但是由于 req 中只有一个元组,所以你最终只有一个记录对于您的每个原始条目。 events_flattened 的架构是:

req::time | req::type | req::s1 | req::s2 | req::id | req::p1 | events::time | events::type | events::s1 | events::s2 | events::id | events::p1

所以现在您可以引用您希望保留的字段,使用 events 作为原始条目,使用 req 作为最旧发送类型条目的替换:

final = FOREACH events_flattened GENERATE 
events::time AS time,
events::type AS type,
req::s1 AS s1,
req::s2 AS s2,
events::id AS id,
events::p1 AS p1;

关于hadoop - Q : how to unnest bags from complicated data structure in PIG,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48536448/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com