gpt4 book ai didi

java - 如何在 Java Spark RDD 上执行标准差和均值运算?

转载 作者:行者123 更新时间:2023-11-30 08:36:59 25 4
gpt4 key购买 nike

我有一个看起来像这样的 JavaRDD。

[
[A,8]
[B,3]
[C,5]
[A,2]
[B,8]
...
...
]

我希望我的结果是均值

[
[A,5]
[B,5.5]
[C,5]
]

如何仅使用 Java RDD 来执行此操作。P.S:我想避免 groupBy 操作,所以我没有使用 DataFrames。

最佳答案

给你:

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.util.StatCounter;
import scala.Tuple2;
import scala.Tuple3;

import java.util.Arrays;
import java.util.List;

public class AggregateByKeyStatCounter {

public static void main(String[] args) {

SparkConf conf = new SparkConf().setAppName("AggregateByKeyStatCounter").setMaster("local");
JavaSparkContext sc = new JavaSparkContext(conf);

List<Tuple2<String, Integer>> myList = Arrays.asList(new Tuple2<>("A", 8), new Tuple2<>("B", 3), new Tuple2<>("C", 5),
new Tuple2<>("A", 2), new Tuple2<>("B", 8));

JavaRDD<Tuple2<String, Integer>> data = sc.parallelize(myList);
JavaPairRDD<String, Integer> pairs = JavaPairRDD.fromJavaRDD(data);

/* I'm actually using aggregateByKey to perform StatCounter
aggregation, so actually you can even have more statistics available */
JavaRDD<Tuple3<String, Double, Double>> output = pairs
.aggregateByKey(
new StatCounter(),
StatCounter::merge,
StatCounter::merge)
.map(x -> new Tuple3<String, Double, Double>(x._1(), x._2().stdev(), x._2().mean()));

output.collect().forEach(System.out::println);
}

}

关于java - 如何在 Java Spark RDD 上执行标准差和均值运算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37509799/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com