gpt4 book ai didi

hadoop - Hive:有没有办法自定义 hiveInputFormat?

转载 作者:可可西里 更新时间:2023-11-01 16:14:05 26 4
gpt4 key购买 nike

场景如下:3 个文件夹位于 hdfs 中。文件如下:

/root/20140901/part-0
/root/20140901/part-1
/root/20140901/part-2
/root/20140902/part-0
/root/20140902/part-1
/root/20140902/part-2
/root/20140903/part-0
/root/20140903/part-1
/root/20140903/part-2

在创建命令如下的 hive 表后,我调用 hql=[select * from hive_combine_test where rdm > 50000;],这将花费 9 个映射器,与文件的数量相同在 hdfs 中。

CREATE EXTERNAL table hive_combine_test
(id string,
rdm string)
PARTITIONED BY (dateid string)
row format delimited fields terminated by '\t'
stored as textfile;

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140901')
location '/root/20140901';

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140902')
location '/root/20140902';

ALTER TABLE hive_combine_test
ADD PARTITION (dateid='20140903')
location '/root/20140903';

但我想要的是将所有的part-i 拆分在一起,这样,应该只有三个mapper

我尝试继承 org.apache.hadoop.hive.ql.io.HiveInputFormat 以测试自定义 JudHiveInputFormat 是否可以工作。

public class JudHiveInputFormat<K extends WritableComparable, V extends Writable>
extends HiveInputFormat<WritableComparable, Writable> {

}

但是当我将它挂载到配置单元中时,它返回异常:

hive> add jar /my_path/jud_udf.jar;
hive> set hive.input.format=com.judking.hive.inputformat.JudHiveInputFormat;
hive> select * from hive_combine_test where rdm > 50000;

java.lang.RuntimeException: com.judking.hive.inputformat.JudCombineHiveInputFormat
at org.apache.hadoop.hive.ql.exec.mr.ExecDriver.execute(ExecDriver.java:290)
at org.apache.hadoop.hive.ql.exec.mr.MapRedTask.execute(MapRedTask.java:136)
at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153)
at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85)
at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1472)
at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1239)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1057)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:880)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:870)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

谁能给我一些线索?非常感谢!

最佳答案

据我所知,要在 Hive 中添加自定义 INPUT/OUTPUT 格式,您需要在创建表语句中提及该格式。像这样的事情:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT '<your input format class name >' OUTPUTFORMAT '<your output format class name>';

由于您只需要 InputFormat,因此您的建表语句将如下所示:

CREATE TABLE (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT 'JudHiveInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat';

为什么要提到这个 OUTPUT 格式类,既然你已经覆盖了 INPUT 格式,Hive 也需要 OUTPUT 类,所以这里我们需要说 Hive 使用它的 DEFAULT OUTPUT 格式类。

也许你可以试一试。

希望对您有所帮助...!!!

关于hadoop - Hive:有没有办法自定义 hiveInputFormat?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25760220/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com