gpt4 book ai didi

hadoop - 如何读取拆分为多行的记录以及如何在输入拆分期间处理损坏的记录

转载 作者:可可西里 更新时间:2023-11-01 14:18:59 25 4
gpt4 key购买 nike

我有一个日志文件如下

Begin ... 12-07-2008 02:00:05         ----> record1
incidentID: inc001
description: blah blah blah
owner: abc
status: resolved
end .... 13-07-2008 02:00:05
Begin ... 12-07-2008 03:00:05 ----> record2
incidentID: inc002
description: blah blah blahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblahblah
owner: abc
status: resolved
end .... 13-07-2008 03:00:05

我想使用 mapreduce 来处理这个。我想提取事件 ID、状态以及事件花费的时间

如何处理这两个记录,因为它们具有可变的记录长度,以及如果输入拆分发生在记录结束之前怎么办。

最佳答案

您需要编写自己的输入格式和记录阅读器,以确保围绕记录分隔符正确拆分文件。

基本上你的记录阅读器需要寻找它的分割字节偏移量,向前扫描(读取行)直到它找到:

  • 开始 ...
    • 读取行直到下一个end ... 行,并在开始和结束之间提供这些行作为下一条记录的输入
  • 它扫描经过拆分的末尾或找到 EOF

这在算法上类似于 Mahout 的 XMLInputFormat处理多行 XML 作为输入——事实上,您可以直接修改此源代码来处理您的情况。

如@irW 的回答所述,NLineInputFormat 是另一种选择,如果您的记录每条记录的行数固定,但对于较大的文件来说确实效率低下,因为它必须打开并读取整个文件在输入格式的 getSplits() 方法中发现行偏移量。

关于hadoop - 如何读取拆分为多行的记录以及如何在输入拆分期间处理损坏的记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/17713476/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com