gpt4 book ai didi

java - 从 DataFrame 转换为 JavaPairRDD

转载 作者:行者123 更新时间:2023-11-30 08:45:24 25 4
gpt4 key购买 nike

我正在尝试使用带有 Java API 的 apache spark 来实现 LDA 算法。方法LDA().run() 接受参数JavaPairRDD 文件。我使用 scala 来创建 RDD[(Long, Vector)] 如下:

val countVectors = cvModel.transform(filteredTokens)
.select("docId", "features")
.map { case Row(docId: Long, countVector: Vector) => (docId, countVector) }
.cache()

然后输入到LDA:

lda.run(countVectors)

但是在 Java API 中,我使用以下代码获得了 CountVectorizerModel:

CountVectorizerModel cvModel = new CountVectorizer()
.setInputCol("filtered").setOutputCol("features")
.setVocabSize(vocabSize).fit(filteredTokens);

看起来像这样:

(0,(22,[0,8,9,10,14,16,18],
[1.0,1.0,1.0,1.0,1.0,1.0,1.0]))
(1,(22,[0,1,2,3,4,5,6,7,11,12,13,15,17,19,20,21],
1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))

如果想从cvModel转换成JavaPairRDD countVectors怎么办?我试过这个:

JavaPairRDD<Long, Vector> countVectors = cvModel.transform(filteredTokens)
.select("docId", "features").toJavaRDD()
.mapToPair(new PairFunction<Row, Long, Vector>() {
public Tuple2<Long, Vector> call(Row row) throws Exception {
return new Tuple2<Long, Vector>(Long.parseLong(row.getString(0)), Vectors.dense(row.getDouble(1)));
}
}).cache();

但它不起作用。尝试时出现异常:

Vectors.dense(row.getDouble(1))

因此,如果您有任何理想的从 DataFrame cvModel 转换为 JavaPairRDD 的方法,请告诉我。

我正在使用 Spark 和 MLlib 1.5.1,以及 Java8

非常感谢任何帮助。谢谢这是我尝试从 DataFrame 转换为 JavaPairRDD 时的异常日志文件

15/10/25 10:03:07 ERROR Executor: Exception in task 0.0 in stage 7.0     (TID 6)
java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at org.apache.spark.sql.Row$class.getString(Row.scala:249)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191)
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:88)
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:1)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/10/25 10:03:07 WARN TaskSetManager: Lost task 0.0 in stage 7.0 (TID 6, localhost): java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.String
at org.apache.spark.sql.Row$class.getString(Row.scala:249)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191)
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:88)
at UIT_LDA_ONLINE.LDAOnline$2.call(LDAOnline.java:1)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1030)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

最佳答案

现在我们有了错误堆栈,下面是错误:

您正在尝试从行中获取字符串,而您的字段是 Long,因此您需要将 row.getString(0) 替换为 row.getLong(0) 对于初学者。

一旦你更正了这个,你就会遇到来自相同类型但不同级别的其他错误,我可以用给定的信息指出这些错误,但如果你应用以下内容,你将能够解决它们:

行 getter 是特定于每个字段类型的,您需要使用正确的获取方法。

如果不确定要使用什么方法,可以在DataFrame上使用printSchema方法检查每个字段的类型,然后就可以进行所有类型转换官方文档here .

关于java - 从 DataFrame 转换为 JavaPairRDD<Long, Vector>,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33308586/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com