gpt4 book ai didi

hadoop - 约185场演出的 pig 堆(202260828记录)

转载 作者:行者123 更新时间:2023-12-02 21:51:01 25 4
gpt4 key购买 nike

我对PIG还是很陌生,但是我了解map / reduce作业的基本概念。我试图根据一些简单的日志为用户找出一些统计信息。我们有一个实用程序可以解析日志中的字段,而我正在使用DataFu来找出方差和四分位数。

我的脚本如下:

log = LOAD '$data' USING SieveLoader('node', 'uid', 'long_timestamp');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL;
--Find all users
SPLIT log_map INTO cloud IF $0#'node' MATCHES '*.mis01*', dev OTHERWISE;
--For the real cloud
cloud = FOREACH cloud GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '192.168.0.231' AS ldap_server;
dev = FOREACH dev GENERATE $0#'uid' AS uid, $0#'long_timestamp' AS long_timestamp:long, 'dev' AS domain, '10.0.0.231' AS ldap_server;
modified_logs = UNION dev, cloud;

--Calculate user times
user_times = FOREACH modified_logs GENERATE *, ToDate((long)long_timestamp) as date;
--Based on weekday/weekend
aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetWeekOrWeekend(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;
--Based on actual day of week
--aliased_user_times = FOREACH user_times GENERATE *, GetYear(date) AS year:int, GetMonth(date) AS month:int, GetDay(date) AS day:int, GetDayOfWeek(date) AS day_of_week, long_timestamp % (24*60*60*1000) AS miliseconds_into_day;

user_days = GROUP aliased_user_times BY (uid, ldap_server,domain, year, month, day, day_of_week);

some_times_by_day = FOREACH user_days GENERATE FLATTEN(group) AS (uid, ldap_server, domain, year, month, day, day_of_week), MAX(aliased_user_times.miliseconds_into_day) AS max, MIN(aliased_user_times.miliseconds_into_day) AS min;

times_by_day = FOREACH some_times_by_day GENERATE *, max-min AS time_on;

times_by_day_of_week = GROUP times_by_day BY (uid, ldap_server, domain, day_of_week);
STORE times_by_day_of_week INTO '/data/times_by_day_of_week';

--New calculation, mean, var, std_d, (min, 25th quartile, 50th (aka median), 75th quartile, max)
averages = FOREACH times_by_day_of_week GENERATE FLATTEN(group) AS (uid, ldap_server, domain, day_of_week), 'USER' as type, AVG(times_by_day.min) AS start_avg, VAR(times_by_day.min) AS start_var, SQRT(VAR(times_by_day.min)) AS start_std, Quartile(times_by_day.min) AS start_quartiles;
--AVG(times_by_day.max) AS end_avg, VAR(times_by_day.max) AS end_var, SQRT(VAR(times_by_day.max)) AS end_std, Quartile(times_by_day.max) AS end_quartiles, AVG(times_by_day.time_on) AS hours_avg, VAR(times_by_day.time_on) AS hours_var, SQRT(VAR(times_by_day.time_on)) AS hours_std, Quartile(times_by_day.time_on) AS hours_quartiles ;

STORE averages INTO '/data/averages';

我已经看到其他人在DataFu一次计算多个分位数时遇到了问题,所以我只想一次计算一个。自定义加载程序一次加载一行,然后通过实用程序将其转换为 map ,并且有一个小的UDF会检查日期是工作日还是周末(本来我们想获取基于天的统计信息)一周的时间,但是加载足够的数据来获得有趣的四分位数会杀死 map /归约任务。

使用 pig 0.11

最佳答案

看来我的特定问题是由于试图计算一条PigLatin行中的最小值和最大值。将工作分成两个不同的命令,然后将它们合并似乎已经解决了我的内存问题

关于hadoop - 约185场演出的 pig 堆(202260828记录),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21006769/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com