hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure)-6ren

hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure)

转载作者：可可西里更新时间：2023-11-01 14:07:15

26

4

我有以下场景-

pig 版使用0.70

示例 HDFS 目录结构:

/user/training/test/20100810/<data files>
/user/training/test/20100811/<data files>
/user/training/test/20100812/<data files>
/user/training/test/20100813/<data files>
/user/training/test/20100814/<data files>

正如您在上面列出的路径中看到的，其中一个目录名称是一个日期戳。

问题:我想加载日期范围为 20100810 到 20100813 的文件。

我可以将日期范围的“从”和“到”作为参数传递给 Pig 脚本，但我如何在 LOAD 语句中使用这些参数。我能够做到以下几点

temp = LOAD '/user/training/test/{20100810,20100811,20100812}' USING SomeLoader() AS (...);

以下适用于 hadoop:

hadoop fs -ls /user/training/test/{20100810..20100813}

但是当我尝试在 pig 脚本中使用 LOAD 时，它失败了。如何使用传递给 Pig 脚本的参数从日期范围加载数据？

错误日志如下:

Backend error message during job submission
-------------------------------------------
org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
        at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:858)
        at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:875)
        at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:793)
        at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:752)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1062)
        at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:752)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:726)
        at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
        at org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
        at java.lang.Thread.run(Thread.java:619)
Caused by: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input Pattern hdfs://<ServerName>.com/user/training/test/{20100810..20100813} matches 0 files
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:231)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextInputFormat.listStatus(PigTextInputFormat.java:36)
        at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:248)
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:258)
        ... 14 more



Pig Stack Trace
---------------
ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias test
        at org.apache.pig.PigServer.openIterator(PigServer.java:521)
        at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544)
        at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162)
        at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138)
        at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
        at org.apache.pig.Main.main(Main.java:357)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: hdfs://<ServerName>.com/user/training/test/{20100810..20100813}
        at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169)

我是否需要使用像 Python 这样的高级语言来捕获范围内的所有日期戳并将它们作为逗号分隔列表传递给 LOAD？

干杯

最佳答案

正如zjffdu所说，路径扩展是由shell完成的。解决问题的一种常见方法是简单地使用 Pig 参数(无论如何，这是使脚本更可重用的好方法):

外壳:

pig -f script.pig -param input=/user/training/test/{20100810..20100812}

脚本. pig :

temp = LOAD '$input' USING SomeLoader() AS (...);

关于hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3515481/

26

4

0

文章推荐： hadoop - $HADOOP_HOME 已弃用

文章推荐： hadoop - Hive 查询快速查找表大小(行数)

文章推荐： hadoop - 在 Hadoop 中将多个文件合并为一个文件

文章推荐： hadoop - `hadoop dfs` 和 `hadoop fs` 之间的区别

Excel vba : possible bug when creating a range as a sub-range of another range
我创建了以下 sub 来简单地说明问题。我将事件工作表的范围 A2:E10 分配给范围变量。然后，对于另一个范围变量，我将这个范围的子范围，单元格 (1, 1) 分配给 (3, 3)。我原以为这将包
vba - 将 Range.Text 关联到 Range.Start 和 Range.End
我使用正则表达式来搜索以下属性返回的纯文本: namespace Microsoft.Office.Interop.Word { public class Range {
excel - If Not (range) or (range) = Nothing Then
我正在开发一个宏来突出显示某些行/单元格以供进一步审查。一些值/空白将以红色突出显示，其他以橙色突出显示，而整行应为黄色。我从上一个问题中得到了一些帮助，并添加了更多细节，它工作得几乎完美，但我被困在
python - "for/range"range 大的时候会不会很耗内存？
这个问题在这里已经有了答案: What is the difference between range and xrange functions in Python 2.X? (28 个答案) 关闭
Python - 不支持的类型 : range and range
我在尝试运行脚本时遇到这个奇怪的错误，代码似乎是正确的，但似乎 python (3) 不喜欢这部分: def function(x): if int
c++ - 为什么从采用 std::ranges::output_range 的算法返回 std::ranges::safe_iterator_t 而不是 std::ranges::safe_subrange_t
我正在编写一种算法，将一些数据写入提供的输出范围(问题的初始文本包括具体细节，这将评论中的讨论转向了错误的方向)。我希望它在 API 中尽可能接近标准库中的其他范围算法。我查看了 std::rang
c++ - 在 range v3 库中，为什么 ranges::copy 不适用于 ranges::views::chunk 的输出？
这按预期工作: #include #include int main() { auto chunklist = ranges::views::ints(1, 13) | ranges::vie
string - 无法将类型 'Range' 的值转换为预期的参数类型 'Range'(又名 'Range')
我这里有一个字符串，我正在尝试对其进行子字符串化。 let desc = "Hello world. Hello World." var stringRange = 1..' 的值转换为预期的参数类型
.net - MySQL时间偏移(Range from within a Range)
我有一个高级搜索功能，可以根据日期和时间查询记录。我想返回日期时间范围内的所有记录，然后从该范围内返回我想将结果缩小到一个小时范围(例如 2012 年 5 月 1 日 - 2012 年 5 月 7 日
function - range 函数和 range 关键字有什么区别？
Go 中的 range 函数和 range 关键字有什么区别？ func main(){ s := []int{10, 20, 30, 40, 50, 60, 70, 80, 90}
scala - 将 Scala Range 拆分为大小均匀的连续子 Range
如果我有一个范围，如何将其拆分为一系列连续的子范围，其中指定了子范围(存储桶)的数量？如果没有足够的元素，则应省略空桶。例如: splitRange(1 to 6, 3) == Seq(Range(
Excel开发: How to detect that a Range overlaps another Range?
我正在开发 VSTO Excel 项目，但在管理 Range 对象时遇到一些问题。实际上，我需要知道当前选定的范围是否与我存储在列表中的另一个范围重叠。所以基本上，我有 2 个 Range 实例，我
c++ - 满足 std::ranges::range 概念
在即将推出的 C++20 系列中，将有 range concept具有以下定义: template concept range = __RangeImpl; // exposition-only de
range - VHDL 'range => ' 0' 命令
希望有人能回答我的问题。我在 VHDL 代码中遇到了这个命令，但不确定它到底做了什么。有人可以澄清以下内容吗？ if ( element1 = (element1'range => '0')) the
python - Python 中 range() 中的嵌套 range()
可以将范围嵌套在范围中吗？使用范围内的变量？因为我想取得一些效果。为了说明这个问题，我有以下伪代码: for i in range(str(2**i) for i in range(1,2)):
python : range between dates when range field has time
我想在 2 个日期之间创建一个范围，并且我的范围字段有时间 damage_list = Damage.objects.filter(entry_date__range=(fdate, tdate))
c++ - 使用基于自动的 Ranged for 循环与使用基于对的 Ranged for 循环
在下面的代码中 #include #include #include int main() { std::unordered_mapm; m["1"]=1; m["2"]=2
excel - 循环不更新 range.row 或 range.column
我试图为我的电子表格做一个简单的循环，它循环遍历一个范围并检查该行是否为空，如果不是，则循环遍历一系列列并检查它们是否为空，如果是则它设置一个消息。问题是每次它通过循环 ro.value 和 col
VBA Excel : Assigning range values to a new range
我在将一个工作簿范围中的值分配给当前工作簿中的某个范围时遇到问题。当我使用 Range("A1:C1") 分配我的范围时，此代码工作正常，但是当我使用 Range(Cells(1,1),Cells(1
vba - Range.Cells.Count 与 Range.Count
我改写了原来的问题。 Sub s() Dim r As Range Set r = ActiveSheet.Range("B2:D5") Debug.Print r.Rows.Count

首页

博学

6Ren·AI

商城

hadoop - pig 拉丁语 : Load multiple files from a date range (part of the directory structure)