gpt4 book ai didi

java - 在Java 1.8中的spark groupBy中按部门查找平均值

转载 作者:行者123 更新时间:2023-11-30 06:52:27 25 4
gpt4 key购买 nike

我有一个下面的数据集,其中第一列是部门,第二列是工资。我想计算按部门的平均工资。

IT  2000000
HR 2000000
IT 1950000
HR 2200000
Admin 1900000
IT 1900000
IT 2200000

我执行了以下操作

JavaPairRDD<String, Iterable<Long>> rddY = employees.groupByKey();
System.out.println("<=========================RDDY collect==================>" + rddY.collect());

并得到以下输出:

<=========================RDDY
collect==================>[(IT,[2000000, 1950000, 1900000, 2200000]),
(HR,[2000000, 2200000]), (Admin,[1900000])]

我需要的是

  1. 我想使用 Spark RDD 计算总平均值和部门平均值。

  2. 如何使用spark中的groupBy函数计算平均值。

最佳答案

下面是使用 Spark JavaPairRDD 按键计算平均值的代码。希望这会有所帮助。

import java.util.ArrayList;
import java.util.List;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

public class SparkAverageCalculation {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("Average Calculation").setMaster("local[2]");
JavaSparkContext sc = new JavaSparkContext(conf);
//inputList
List<Tuple2<String,Integer>> inputList = new ArrayList<Tuple2<String,Integer>>();
inputList.add(new Tuple2<String,Integer>("a1", 30));
inputList.add(new Tuple2<String,Integer>("b1", 30));
inputList.add(new Tuple2<String,Integer>("a1", 40));
inputList.add(new Tuple2<String,Integer>("a1", 20));
inputList.add(new Tuple2<String,Integer>("b1", 50));
//parallelizePairs
JavaPairRDD<String, Integer> pairRDD = sc.parallelizePairs(inputList);
//count each values per key
JavaPairRDD<String, Tuple2<Integer, Integer>> valueCount = pairRDD.mapValues(value -> new Tuple2<Integer, Integer>(value,1));
//add values by reduceByKey
JavaPairRDD<String, Tuple2<Integer, Integer>> reducedCount = valueCount.reduceByKey((tuple1,tuple2) -> new Tuple2<Integer, Integer>(tuple1._1 + tuple2._1, tuple1._2 + tuple2._2));
//calculate average
JavaPairRDD<String, Integer> averagePair = reducedCount.mapToPair(getAverageByKey);
//print averageByKey
averagePair.foreach(data -> {
System.out.println("Key="+data._1() + " Average=" + data._2());
});
//stop sc
sc.stop();
sc.close();
}

private static PairFunction<Tuple2<String, Tuple2<Integer, Integer>>,String,Integer> getAverageByKey = (tuple) -> {
Tuple2<Integer, Integer> val = tuple._2;
int total = val._1;
int count = val._2;
Tuple2<String, Integer> averagePair = new Tuple2<String, Integer>(tuple._1, total / count);
return averagePair;
};
}

关于java - 在Java 1.8中的spark groupBy中按部门查找平均值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38847188/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com