gpt4 book ai didi

java - Deeplearning4j 触发管道 : Convert a String type to org. apache.spark.mllib.linalg.VectorUDT

转载 作者:行者123 更新时间:2023-11-29 04:50:46 25 4
gpt4 key购买 nike

我有一个情绪分析程序,可以使用循环中性网络预测给定的电影评论是正面的还是负面的。我正在为该程序使用 Deeplearning4j 深度学习库。现在我需要将该程序添加到 apache spark 管道。

在执行此操作时,我有一个扩展 org.apache.spark.ml.classification.ProbabilisticClassifier 的类 MovieReviewClassifier 并且我必须将该类的一个实例添加到管道。使用 setFeaturesCol(String s) 方法将构建模型所需的特征输入到程序中。我添加的特征采用 String 格式,因为它们是一组用于情感分析的字符串。但是这些功能应该采用 org.apache.spark.mllib.linalg.VectorUDT 的形式。有没有办法将字符串转换为 Vector UDT?

我在下面附上了我的管道实现代码:

public class RNNPipeline {
final static String RESPONSE_VARIABLE = "s";
final static String INDEXED_RESPONSE_VARIABLE = "indexedClass";
final static String FEATURES = "features";
final static String PREDICTION = "prediction";
final static String PREDICTION_LABEL = "predictionLabel";

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("test-client").setMaster("local[2]");
sparkConf.set("spark.driver.allowMultipleContexts", "true");
JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
SQLContext sqlContext = new SQLContext(javaSparkContext);

// ======================== Import data ====================================
DataFrame dataFrame = sqlContext.read().format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true")
.load("/home/RNN3/WordVec/training.csv");

// Split in to train/test data
double [] dataSplitWeights = {0.7,0.3};
DataFrame[] data = dataFrame.randomSplit(dataSplitWeights);



// ======================== Preprocess ===========================



// Encode labels
StringIndexerModel labelIndexer = new StringIndexer().setInputCol(RESPONSE_VARIABLE)
.setOutputCol(INDEXED_RESPONSE_VARIABLE)
.fit(data[0]);


// Convert indexed labels back to original labels (decode labels).
IndexToString labelConverter = new IndexToString().setInputCol(PREDICTION)
.setOutputCol(PREDICTION_LABEL)
.setLabels(labelIndexer.labels());


// ======================== Train ========================



MovieReviewClassifier mrClassifier = new MovieReviewClassifier().setLabelCol(INDEXED_RESPONSE_VARIABLE).setFeaturesCol("Review");



// Fit the pipeline for training..setLabelCol.setLabelCol.setLabelCol.setLabelCol
Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] { labelIndexer, mrClassifier, labelConverter});
PipelineModel pipelineModel = pipeline.fit(data[0]);

}

}

Review 是包含要预测为正或负的字符串的特征列。

执行代码时出现以下错误:

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column Review must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually StringType.
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:50)
at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:167)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:167)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:62)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:121)
at RNNPipeline.main(RNNPipeline.java:82)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

最佳答案

根据其documentation

User-defined type for Vector which allows easy interaction with SQL via DataFrame.

而且在 ML library 中的事实

DataFrame supports many basic and structured types; see the Spark SQL datatype reference for a list of supported types. In addition to the types listed in the Spark SQL guide, DataFrame can use ML Vector types.

事实上你被要求提供 org.apache.spark.sql.types.UserDefinedType<Vector>

您可能可以通过传递 DenseVector 来逃脱或 SparseVector , 从你的 String 创建.

来自 String 的转换( "Review" ??? ) 到 Vector取决于您组织数据的方式。

关于java - Deeplearning4j 触发管道 : Convert a String type to org. apache.spark.mllib.linalg.VectorUDT,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35502161/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com