gpt4 book ai didi

hadoop - 从 mapreduce 读取配置单元表

转载 作者:可可西里 更新时间:2023-11-01 14:47:34 25 4
gpt4 key购买 nike

我目前正在编写一个 mapreduce 程序来查找两个配置单元表之间的差异。我的配置单元表按一列或多列进行分区。所以文件夹名称包含分区列的值。

有没有办法读取hive分区表

可以在mapper中读取吗?

最佳答案

由于底层 HDFS 数据将默认组织在分区的配置单元表中

 table/root/folder/x=1/y=1
table/root/folder/x=1/y=2
table/root/folder/x=2/y=1
table/root/folder/x=2/y=2....,

您可以在驱动程序中构建这些输入路径中的每一个,并通过多次调用 FileInputFormat.addInputPath(job, path) 添加它们。您构建的每个文件夹路径一次调用。

下面粘贴了示例代码。请注意如何将路径添加到 MyMapper.class。在此示例中,我使用的是 MultipleInputs API。表按“part”和“xdate”进行分区。

public class MyDriver extends Configured implements Tool {
public int run(String[] args) throws Exception {
Configuration conf = getConf();
conf.set("mapred.compress.map.output", "true");
conf.set("mapred.output.compression.type", "BLOCK");

Job job = new Job(conf);
//set up various job parameters
job.setJarByClass(MyDriver.class);
job.setJobName(conf.get("job.name"));
MultipleInputs.addInputPath(job, new Path(conf.get("root.folder")+"/xdate="+conf.get("start.date")), TextInputFormat.class, OneMapper.class);
for (Path path : getPathList(job,conf)) {
System.out.println("path: "+path.toString());
MultipleInputs.addInputPath(job, path, Class.forName(conf.get("input.format")).asSubclass(FileInputFormat.class).asSubclass(InputFormat.class), MyMapper.class);
}
...
...
return job.waitForCompletion(true) ? 0 : -2;

}

private static ArrayList<Path> getPathList(Job job, Configuration conf) {
String rootdir = conf.get("input.path.rootfolder");
String partlist = conf.get("part.list");
String startdate_s = conf.get("start.date");
String enxdate_s = conf.get("end.date");
ArrayList<Path> pathlist = new ArrayList<Path>();
String[] partlist_split = partlist.split(",");
SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd");
Date startdate_d = null;
Date enxdate_d = null;
Path path = null;
try {
startdate_d = sdf.parse(startdate_s);
enxdate_d = sdf.parse(enxdate_s);
GregorianCalendar gcal = new GregorianCalendar();
gcal.setTime(startdate_d);
Date d = null;
for (String part : partlist_split) {
gcal.setTime(startdate_d);
do {
d = gcal.getTime();
FileSystem fs = FileSystem.get(conf);
path = new Path(rootdir + "/part=" + part + "/xdate="
+ sdf.format(d));
if (fs.exists(path)) {
pathlist.add(path);
}
gcal.add(Calendar.DAY_OF_YEAR, 1);
} while (d.before(enxdate_d));
}
} catch (Exception e) {
e.printStackTrace();
}
return pathlist;
}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new MyDriver(), args);
System.exit(res);
}
}

关于hadoop - 从 mapreduce 读取配置单元表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/16191477/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com