gpt4 book ai didi

java - 在 Spark 中排序时出现 NotSerializableException

转载 作者:塔克拉玛干 更新时间:2023-11-03 04:30:23 25 4
gpt4 key购买 nike

我正在尝试编写一个简单的流处理 Spark 作业,它将获取消息列表(JSON 格式),每条消息属于一个用户,计算每个用户的消息并打印前十名用户。

但是,当我定义 Comparator> 来对减少的计数进行排序时,整个事情都失败了,并抛出了 java.io.NotSerializableException

我对 Spark 的 Maven 依赖:

<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.9.3</artifactId>
<version>0.8.0-incubating</version>

我正在使用的 Java 代码:

public static void main(String[] args) {

JavaSparkContext sc = new JavaSparkContext("local", "spark");

JavaRDD<String> lines = sc.textFile("stream.sample.txt").cache();

JavaPairRDD<String, Long> words = lines
.map(new Function<String, JsonElement>() {
// parse line into JSON
@Override
public JsonElement call(String t) throws Exception {
return (new JsonParser()).parse(t);
}

}).map(new Function<JsonElement, String>() {
// read User ID from JSON
@Override
public String call(JsonElement json) throws Exception {
return json.getAsJsonObject().get("userId").toString();
}

}).map(new PairFunction<String, String, Long>() {
// count each line
@Override
public Tuple2<String, Long> call(String arg0) throws Exception {
return new Tuple2(arg0, 1L);
}

}).reduceByKey(new Function2<Long, Long, Long>() {
// count messages for every user
@Override
public Long call(Long arg0, Long arg1) throws Exception {
return arg0 + arg1;
}

});

// sort result in a descending order and take 10 users with highest message count
// This causes the exception
List<Tuple2<String, Long>> sorted = words.takeOrdered(10, new Comparator<Tuple2<String, Long>> (){

@Override
public int compare(Tuple2<String, Long> o1, Tuple2<String, Long> o2) {
return -1 * o1._2().compareTo(o2._2());
}

});

// print result
for (Tuple2<String, Long> tuple : sorted) {
System.out.println(tuple._1() + ": " + tuple._2());
}

}

生成的堆栈跟踪:

java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: net.imagini.spark.test.App$5
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:670)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:668)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:668)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:376)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

我浏览了 Spark API 文档,但找不到任何可以为我指明正确方向的内容。我做错了什么或者这是 Spark 中的错误?任何帮助将不胜感激。

最佳答案

正如@vanco.anton 提到的,您可以使用 Java 8 函数式接口(interface)执行如下操作:

public interface SerializableComparator<T> extends Comparator<T>, Serializable {

static <T> SerializableComparator<T> serialize(SerializableComparator<T> comparator) {
return comparator;
}

}

然后在你的代码中:

import static SerializableComparator.serialize;
...
rdd.top(10, serialize((a, b) -> -a._2.compareTo(b._2)));

关于java - 在 Spark 中排序时出现 NotSerializableException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19433135/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com