gpt4 book ai didi

apache-pig - 如何在 Apache Pig 上执行正确的数据类型?

转载 作者:行者123 更新时间:2023-12-04 23:28:22 25 4
gpt4 key购买 nike

由于数据类型错误,我无法对一包值求和。

当我加载一个 csv 文件时,其行如下所示:

6   574 false   10.1.72.23  2010-05-16 13:56:19 +0930   fbcdn.net   static.ak.fbcdn.net 304 text/css    1   /rsrc.php/zPTJC/hash/50l7x7eg.css   http    pwong

使用以下内容:
logs_base = FOREACH raw_logs GENERATE
FLATTEN(
EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"')
)
as (
account_id: int,
bytes: long,
cached: chararray,
ip: chararray,
time: chararray,
domain: chararray,
host: chararray,
status: chararray,
mime_type: chararray,
page_view: chararray,
path: chararray,
protocol: chararray,
username: chararray
);

所有字段似乎都加载正常,并且类型正确,如“describe”命令所示:
grunt> describe logs_base
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray}

每当我使用以下方法执行 SUM 时:
bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes);

并存储或转储内容,mapreduce 过程失败并出现以下错误:
org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87)
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79)
... 15 more

引起我注意的一行是:
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long

这让我相信提取函数不会将字节字段转换为所需的数据类型(长)。

有没有办法强制提取功能转换为正确的数据类型?如何在不必对所有记录执行 FOREACH 的情况下转换该值? (将时间转换为 unix 时间戳并尝试查找 MIN 时会发生同样的问题。我绝对想找到一个不需要不必要投影的解决方案)。

任何指针将不胜感激。非常感谢你的帮助。

问候,
豪尔赫 C.

附注我在 Amazon elastic mapreduce 服务上以交互模式运行它。

最佳答案

您是否尝试过 cast从 UDF 中检索到的数据?在此处应用架构不会执行任何转换。

例如

logs_base = 
FOREACH raw_logs
GENERATE
FLATTEN(
(tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...')
)
AS (account_id: INT, ...);

关于apache-pig - 如何在 Apache Pig 上执行正确的数据类型?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/8828839/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com