gpt4 book ai didi

hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure)

转载 作者:可可西里 更新时间:2023-11-01 14:07:15 26 4
gpt4 key购买 nike

我有以下场景-

pig 版使用0.70

示例 HDFS 目录结构:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

正如您在上面列出的路径中看到的,其中一个目录名称是一个日期戳。

问题:我想加载日期范围为 20100810 到 20100813 的文件。

我可以将日期范围的“从”和“到”作为参数传递给 Pig 脚本,但我如何在 LOAD 语句中使用这些参数。我能够做到以下几点

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

以下适用于 hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

但是当我尝试在 pig 脚本中使用 LOAD 时,它失败了。如何使用传递给 Pig 脚本的参数从日期范围加载数据?

错误日志如下:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
at org.apache.pig.PigServer.openIterator(PigServer.java:521)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

我是否需要使用像 Python 这样的高级语言来捕获范围内的所有日期戳并将它们作为逗号分隔列表传递给 LOAD?

干杯

最佳答案

正如zjffdu所说,路径扩展是由shell完成的。解决问题的一种常见方法是简单地使用 Pig 参数(无论如何,这是使脚本更可重用的好方法):

外壳:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

脚本. pig :

temp = LOAD '$input' USING SomeLoader() AS (...);

关于hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/3515481/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com