gpt4 book ai didi

apache-spark - org.apache.avro.generic.GenericData$Record 上的对象不可序列化错误

转载 作者:行者123 更新时间:2023-12-03 20:18:39 26 4
gpt4 key购买 nike

我正在使用以下代码为我使用 Sqoop 从 MySQL 导入 Hive 的文件创建 RDD:

def rddFromParquetHdfsFile(path: String): RDD[GenericRecord] = {
val job = new Job()
FileInputFormat.setInputPaths(job, path)
ParquetInputFormat.setReadSupportClass(job,
classOf[AvroReadSupport[GenericRecord]])
return sc.newAPIHadoopRDD(job.getConfiguration,
classOf[ParquetInputFormat[GenericRecord]],
classOf[Void],
classOf[GenericRecord]).map(x => x._2)
}
val warehouse = "hdfs://quickstart/user/hive/warehouse/"
val order_items = rddFromParquetHdfsFile(warehouse + "order_items");
val products = rddFromParquetHdfsFile(warehouse + "products");

我现在尝试查看前 5 个产品:
products.take(5)

我最终出现以下错误:
org.apache.spark.SparkException: 
Job aborted due to stage failure: Task 0.0 in stage 0.0 (TID 0) had a
not serializable result: org.apache.avro.generic.GenericData$Record
Serialization stack:
- object not serializable (class:
org.apache.avro.generic.GenericData$Record, value: {"product_id": 1,
"product_category_id": 2, "product_name": "Quest Q64 10 FT. x 10 FT. Slant Leg Instant U", "product_description": "", "product_price": 59.98, "product_image": "http://images.acmesports.sports /Quest+Q64+10+FT.+x+10+FT.+Slant+Leg+Instant+Up+Canopy"})
- element of array (index: 0)
- array (class [Lorg.apache.avro.generic.GenericRecord;, size 4)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)

关于如何解决这个问题的任何建议?

最佳答案

用 Spark conf 试试这个:

conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") 
conf.registerKryoClasses(Array(classOf[org.apache.avro.generic.GenericData.Record]))

关于apache-spark - org.apache.avro.generic.GenericData$Record 上的对象不可序列化错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35911617/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com