gpt4 book ai didi

java - Spark ML-无法使用 MatrixFactorizationModel 加载模型

转载 作者:行者123 更新时间:2023-11-30 07:12:29 25 4
gpt4 key购买 nike

我正在尝试使用 Spark 协同过滤来实现推荐系统。

首先我准备模型并保存到磁盘:

MatrixFactorizationModel model = trainModel(inputDataRdd);  
model.save(jsc.sc(), "/op/tc/model/");

当我使用单独的进程加载模型时,程序失败并出现以下异常:
代码:

   static JavaSparkContext jsc ;
private static Options options;
static{
SparkConf conf = new SparkConf().setAppName("TC recommender application");
conf.set("spark.driver.allowMultipleContexts", "true");
jsc= new JavaSparkContext(conf);
}
MatrixFactorizationModel model = MatrixFactorizationModel.load(jsc.sc(),
"/op/tc/model/");

异常(exception):

Exception in thread "main" java.io.IOException: Not a file: maprfs:/op/tc/model/data at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952) at org.apache.spark.rdd.RDD$$anonfun$aggregate$1.apply(RDD.scala:1114) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1107) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.countApproxDistinctUserProduct(MatrixFactorizationModel.scala:96) at org.apache.spark.mllib.recommendation.MatrixFactorizationModel.predict(MatrixFactorizationModel.scala:126) at com.aexp.cxp.recommendation.ProductRecommendationIndividual.main(ProductRecommendationIndividual.java:62) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:742) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

我需要设置什么配置来加载模型吗?任何建议都会有很大帮助。

最佳答案

与任何其他分布式计算框架一样,在 Spark 中,当您尝试调试代码时,了解代码的运行位置非常重要。能够访问各种类型也很重要。例如,在 YARN 中,您将具有:

  • 主日志(如果您自己记录)
  • 聚合的从属日志(感谢 YARN,有用的功能!)
  • YARN 节点管理器(例如,会告诉您容器被终止的原因等)
  • 等等

如果您没有从一开始就看对地方,那么深入研究 Spark 问题可能会非常耗时。现在更具体地说,关于这个问题,您有一个清晰的堆栈跟踪,但情况并非总是如此,因此您应该利用它来发挥自己的优势。

堆栈跟踪的顶部是

Exception in thread "main" java.io.IOException: Not a file: maprfs:/op/tc/model/data at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:324) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at

如您所见,Spark 作业在执行 map 操作时失败。谁执行map?从属设备,因此您必须确保您的文件在所有从属设备上可用,而不仅仅是在主设备上可用。

更一般地说,您始终需要在头脑中清楚地区分为主服务器编写的代码和为从服务器编写的代码。这将帮助您检测这种交互,以及对不可序列化对象的引用和此类常见错误。

关于java - Spark ML-无法使用 MatrixFactorizationModel 加载模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39006742/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com