gpt4 book ai didi

java - 查找数据集中的 Top-K 记录

转载 作者:可可西里 更新时间:2023-11-01 16:27:21 26 4
gpt4 key购买 nike

为了学习Hadoop,我正在练习《Hadoop in Action》一书中 Unresolved 编程问题

数据集样本:

3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,, 3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,, 3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,, 3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,, 3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,, 3070806,1963,1096,,"US","PA",,1,,2,6,63,,0,,,,,,,,, 3070807,1963,1096,,"US","OH",,1,,623,3,39,,3,,0.4444,,,,,,, 3070808,1963,1096,,"US","IA",,1,,623,3,39,,4,,0.375,,,,,,, 3070809,1963,1096,,"US","AZ",,1,,4,6,65,,0,,,,,,,,, 3070810,1963,1096,,"US","IL",,1,,4,6,65,,3,,0.4444,,,,,,,

map 函数

public static class MapClass extends MapReduceBase implements Mapper<Text, Text, IntWritable, Text> {
private int maxClaimCount = 0;
private Text record = new Text();

public void map(Text key, Text value, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
String claim = value.toString().split(",")[7];
//if (!claim.isEmpty() && claim.matches("\\d")) {
if (!claim.isEmpty()) {
int claimCount = Integer.parseInt(claim);
if (claimCount > maxClaimCount) {
maxClaimCount = claimCount;
record = value;
output.collect(new IntWritable(claimCount), value);
}
// output.collect(new IntWritable(claimCount), value);
}
}

}

约简函数

public static class Reduce extends MapReduceBase implements Reducer<IntWritable, Text, IntWritable, Text> {

public void reduce(IntWritable key, Iterator<Text> values, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(key, values.next());
}
}

要运行的命令:

hadoop jar ~/Desktop/wc.jar com/hadoop/patent/TopKRecords -Dmapred.map.tasks=7 ~/input  ~/output

要求:
- 根据第九列的值,从数据集中找到前K条记录(比如7条)

问题:
- 由于只需要 7 个最高记录,我运行了 7 个 map task ,并确保我获得了最高数量的记录,如 maxClaimCountrecord
- 我不知道如何只收集最大记录,以便每个 map 只发出一个输出

我该怎么做?

最佳答案

这是一个更新的答案。所有评论都不适用于它,因为它们基于原始(不正确的)答案。


映射器应该只输出

output.collect(new IntWritable(claimCount), value);

没有任何比较。结果将根据 claim 计数进行排序并传递给 reducer。

在 Reducer 中使用一些优先级队列来获取前 7 个结果。

关于java - 查找数据集中的 Top-K 记录,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9202395/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com