gpt4 book ai didi

java - Spark,序列文件时出现NegativeArraySizeException

转载 作者:可可西里 更新时间:2023-11-01 15:43:20 27 4
gpt4 key购买 nike

我使用的 spark 是 2.3。

我有这段代码片段,它读取'hdfspath'下的序列文件(这个路径下大约有20个文件,每个文件大约60MB),

SparkSession spark = ...;
JavaSparkContext jsc = JavaSparkContext.fromSparkContext(spark.sparkContext());
JavaPairRDD<BytesWritable, BytesWritable> temp = jsc.sequenceFile(hdfspath, BytesWritable.class, BytesWritable.class);
temp.take(1);

它给了我这个错误,

19/04/03 14:50:18 INFO CodecPool: Got brand-new decompressor [.gz]
19/04/03 14:50:18 INFO CodecPool: Got brand-new decompressor [.gz]
19/04/03 14:50:18 INFO CodecPool: Got brand-new decompressor [.gz]
19/04/03 14:50:18 INFO CodecPool: Got brand-new decompressor [.gz]
19/04/03 14:50:18 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NegativeArraySizeException
at org.apache.hadoop.io.BytesWritable.setCapacity(BytesWritable.java:144)
at org.apache.hadoop.io.BytesWritable.setSize(BytesWritable.java:123)
at org.apache.hadoop.io.BytesWritable.readFields(BytesWritable.java:179)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
at org.apache.hadoop.io.SequenceFile$Reader.deserializeKey(SequenceFile.java:2606)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2597)
at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:82)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:277)

我正在尝试读取的 hdfs 文件是具有这样输出设置的旧 mapreduce 作业的输出,

job.setOutputKeyClass(BytesWritable.class);
job.setOutputValueClass(BytesWritable.class);
job.setOutputFormatClass(SequenceFileAsBinaryOutputFormat.class);
SequenceFileAsBinaryOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);

我研究了 org.apache.hadoop.io.BytesWritable.setCapacity(...) 方法,

public void setSize(int size) {
if (size > getCapacity()) {
setCapacity(size * 3 / 2);
}
this.size = size;
}

不知何故size参数为808464432,做size*3时会溢出,最终导致NegativeArraySizeException。

任何人都可以帮助解释为什么会发生这种情况,以及如何解决它?

最佳答案

想通了。使用 JavaSparkContext#newAPIHadoopFile 而不是 JavaSparkContext#sequenceFile

关于java - Spark,序列文件时出现NegativeArraySizeException,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55489244/

27 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com