gpt4 book ai didi

hadoop - 使用Apache Pig进行日志分析

转载 作者:行者123 更新时间:2023-12-02 21:51:29 24 4
gpt4 key购买 nike

我有以下行的日志:

in24.inetnebr.com - - [01/Aug/1995:00:00:01 -0400] "GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0" 200 1839

其中第一列( in24.inetnebr.com)是主机,第二列( 01/Aug/1995:00:00:01 -0400)是时间戳,第三列( GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/1.0)是下载页面。

如何为Pig的每个主机找到最近下载的两个页面?

非常感谢您的帮助!

最佳答案

仅供引用,我已经解决了这个问题:

REGISTER piggybank.jar
DEFINE SUBSTRING org.apache.pig.piggybank.evaluation.string.SUBSTRING();

raw = LOAD 'nasa' USING org.apache.hcatalog.pig.HCatLoader(); --cast the data, to make possible the usage of string functions

rawCasted = FOREACH raw GENERATE (chararray)host as host, (chararray)xdate as xdate,(chararray)address as address; --cut out the date, and put together the used columns

rawParsed = FOREACH rawCasted GENERATE host, SUBSTRING(xdate,1,20) as xdate, address; --make sure that the not full columns are omitted

rawFiltered = FILTER rawParsed BY xdate IS NOT NULL; --cast the timestamp to timestamp format

analysisTable = FOREACH rawFiltered GENERATE host, ToDate(xdate, 'dd/MMM/yyyy:HH:mm:ss') as xdate, address;

aTgrouped = GROUP analysisTable BY host;

resultsB = FOREACH aTgrouped {
elems=ORDER analysisTable BY xdate DESC;
two=LIMIT elems 2; --Choose the last two page

fstB=ORDER two BY xdate DESC;
fst=LIMIT fstB 1; --Choose the last page

sndB=ORDER two BY xdate ASC;
snd=LIMIT sndB 1; --Choose the previous page

GENERATE FLATTEN(group), fst.address, snd.address; --Put together the pages
};
DUMP resultsB;

关于hadoop - 使用Apache Pig进行日志分析,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20482105/

24 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com