java - 将 JavaRDD<Row> 转换为 JavaRDD<Vector>-6ren

java - 将 JavaRDD 转换为 JavaRDD

转载作者：行者123 更新时间：2023-11-30 08:03:10

24

4

我正在尝试对维基百科 XML 转储执行 LDA。在获得原始文本的 RDD 之后，我创建了一个数据框并通过 Tokenizer、StopWords 和 CountVectorizer 管道对其进行转换。我打算将 Vectors 输出的 RDD 从 CountVectorizer 传递到 MLLib 中的 OnlineLDA。这是我的代码:

 // Configure an ML pipeline
 RegexTokenizer tokenizer = new RegexTokenizer()
   .setInputCol("text")
   .setOutputCol("words");

 StopWordsRemover remover = new StopWordsRemover()
          .setInputCol("words")
          .setOutputCol("filtered");

 CountVectorizer cv = new CountVectorizer()
          .setVocabSize(vocabSize)
          .setInputCol("filtered")
          .setOutputCol("features");

 Pipeline pipeline = new Pipeline()
          .setStages(new PipelineStage[] {tokenizer, remover, cv});

// Fit the pipeline to train documents.
 PipelineModel model = pipeline.fit(fileDF);

 JavaRDD<Vector> countVectors = model.transform(fileDF)
          .select("features").toJavaRDD()
          .map(new Function<Row, Vector>() {
            public Vector call(Row row) throws Exception {
                Object[] arr = row.getList(0).toArray();

                double[] features = new double[arr.length];
                int i = 0;
                for(Object obj : arr){
                    features[i++] = (double)obj;
                }
                return Vectors.dense(features);
            }
          });

由于该行，我得到了类转换异常

Object[] arr = row.getList(0).toArray();


Caused by: java.lang.ClassCastException: org.apache.spark.mllib.linalg.SparseVector cannot be cast to scala.collection.Seq
at org.apache.spark.sql.Row$class.getSeq(Row.scala:278)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getSeq(rows.scala:192)
at org.apache.spark.sql.Row$class.getList(Row.scala:286)
at org.apache.spark.sql.catalyst.expressions.GenericRow.getList(rows.scala:192)
at xmlProcess.ParseXML$2.call(ParseXML.java:142)
at xmlProcess.ParseXML$2.call(ParseXML.java:1)

我找到了执行此操作的 Scala 语法 here但找不到任何用 Java 来做的例子。我试过 row.getAs[Vector](0) 但这只是 Scala 语法。有什么方法可以用 Java 实现吗？

最佳答案

因此，我能够通过简单转换为 Vector 来完成此操作。我不知道为什么我没有先尝试简单的事情!

         JavaRDD<Vector> countVectors = model.transform(fileDF)
              .select("features").toJavaRDD()
              .map(new Function<Row, Vector>() {
                public Vector call(Row row) throws Exception {
                    return (Vector)row.get(0);
                }
              });

关于java - 将 JavaRDD<Row> 转换为 JavaRDD<Vector>，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/36410804/

24

4

0

文章推荐： javascript - jquery 通过 src 属性查找元素

文章推荐： java - 哪种 Java 设计模式适合下面描述的情况？

文章推荐： javascript - 过滤 Javascript 对象数组

文章推荐： java - 使用 joda time 确定 if 语句持续多长时间(处理)

首页

博学

6Ren·AI

商城

java - 将 JavaRDD 转换为 JavaRDD