- html - 出于某种原因,IE8 对我的 Sass 文件中继承的 html5 CSS 不友好?
- JMeter 在响应断言中使用 span 标签的问题
- html - 在 :hover and :active? 上具有不同效果的 CSS 动画
- html - 相对于居中的 html 内容固定的 CSS 重复背景?
我正在努力创建 LDA 模型。
这是我到目前为止所做的 - 创建一个 unigram 并将数据帧转换为基于 this post 的 RDD .
代码如下:
countVectors = CountVectorizer(inputCol="unigrams", outputCol="features", vocabSize=3, minDF=2.0)
model = countVectors.fit(res)
result = model.transform(res)
result.show(5, truncate=False)
这是数据集
+------------------------------------------------------------------------+---+-------------------+
|unigrams |id |features |
+------------------------------------------------------------------------+---+-------------------+
|[born, furyth, leaguenemesi, rise, (the, leaguenemesi, rise, seri, book]|0 |(3,[0,1],[1.0,1.0])|
|[hous, raven, (the, nightfal, chronicl, book] |1 |(3,[0,1],[1.0,1.0])|
|[law, 101everyth, need, know, american, law, fourth, edit] |2 |(3,[],[]) |
|[hot, summer, night] |3 |(3,[],[]) |
|[wet, bundlemega, collect, sex, stori, (30, book, box, set)] |4 |(3,[0],[1.0]) |
+------------------------------------------------------------------------+---+-------------------+
根据上面的基本数据,我根据我正在关注的数据 block 帖子创建了 MLLib 所需的以下 rdd。
from pyspark.mllib.linalg import Vector, Vectors
rdd_convert = result.rdd
corpus = rdd_convert.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
corpus.take(4)
上述代码生成以下数据:
[[0,
Row(unigrams=['born', 'furyth', 'leaguenemesi', 'rise', '(the', 'leaguenemesi', 'rise', 'seri', 'book'], id=0, features=SparseVector(3, {0: 1.0, 1: 1.0}))],
[1,
Row(unigrams=['hous', 'raven', '(the', 'nightfal', 'chronicl', 'book'], id=1, features=SparseVector(3, {0: 1.0, 1: 1.0}))],
[2,
Row(unigrams=['law', '101everyth', 'need', 'know', 'american', 'law', 'fourth', 'edit'], id=2, features=SparseVector(3, {}))],
[3,
Row(unigrams=['hot', 'summer', 'night'], id=3, features=SparseVector(3, {}))]]
现在我想在 RDD 上使用 LDA
from pyspark.mllib.clustering import LDA, LDAModel
# Cluster the documents into three topics using LDA
from pyspark.mllib.linalg import Vectors
type(corpus)
rdd = spark.sparkContext.parallelize(corpus.collect())
type(rdd)
如果我运行 ldaModel = LDA.train(rdd),我会收到以下错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-33-2abff4618359> in <module>()
----> 1 ldaModel = LDA.train(rdd)
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/mllib/clustering.py in train(cls, rdd, k, maxIterations, docConcentration, topicConcentration, seed, checkpointInterval, optimizer)
1037 model = callMLlibFunc("trainLDAModel", rdd, k, maxIterations,
1038 docConcentration, topicConcentration, seed,
-> 1039 checkpointInterval, optimizer)
1040 return LDAModel(model)
1041
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/mllib/common.py in callMLlibFunc(name, *args)
128 sc = SparkContext.getOrCreate()
129 api = getattr(sc._jvm.PythonMLLibAPI(), name)
--> 130 return callJavaFunc(sc, api, *args)
131
132
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/mllib/common.py in callJavaFunc(sc, func, *args)
121 """ Call Java Function """
122 args = [_py2java(sc, a) for a in args]
--> 123 return _java2py(sc, func(*args))
124
125
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
1131 answer = self.gateway_client.send_command(command)
1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
1134
1135 for temp_arg in temp_args:
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/Documents/spark/spark-2.2.1-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(
Py4JJavaError: An error occurred while calling o401.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 1 times, most recent failure: Lost task 0.0 in stage 26.0 (TID 81, localhost, executor driver): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.ml.linalg.SparseVector)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1353)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1352)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1354)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.take(RDD.scala:1327)
at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:166)
at org.apache.spark.mllib.clustering.EMLDAOptimizer.initialize(LDAOptimizer.scala:80)
at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:331)
at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainLDAModel(PythonMLLibAPI.scala:552)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for pyspark.ml.linalg.SparseVector)
at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1353)
at org.apache.spark.mllib.api.python.SerDeBase$$anonfun$pythonToJava$1$$anonfun$apply$2.apply(PythonMLLibAPI.scala:1352)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:389)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at scala.collection.AbstractIterator.to(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354)
at org.apache.spark.rdd.RDD$$anonfun$take$1$$anonfun$29.apply(RDD.scala:1354)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2069)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
我试图解决 this way但没有用。任何解决此问题的帮助将不胜感激
最佳答案
如果你使用 Spark 2.2,你应该使用 pyspark.ml.clustering.LDA
而不是 mllib
之一:
from pyspark.ml.clustering import LDA
LDA().fit(result)
但是,如果你想让 mllib
变体工作,正确的格式是 [label, pyspark.mllib.linalg.Vector]
:
from pyspark.mllib.linalg import Vectors as MLlibVectors
from pyspark.mllib.clustering import LDA as MLlibLDA
MLlibLDA.train(
result.select("id", "features").rdd.mapValues(MLlibVectors.fromML).map(list)
)
关于python - 构造 ClassDict 的预期参数为零(对于 pyspark.ml.linalg.SparseVector),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50668577/
谁能解释一下原因: (define a (lambda() (cons a #f))) (car (a)) ==> procedure ((car (a))) ==> (procedure . #f)
这是 PyBrain 网站的摘录。我了解大部分正在发生的事情,但是一行让我完全难住了。我以前从未在 python 代码中看到过这样的东西。这是整个循环,对于上下文: for c in [0,
我是gradle / groovy的新手。我想创建将做一些事情的自定义任务。我的第一个问题是任务完成时该如何做?我可以覆盖doFirst / doLast闭包吗?也许我可以重写某些在开始和结束时都会执
我刚刚开始评估 MS 企业库。他们使用以下指令来获取实例: var customerDb = EnterpriseLibraryContainer.Current.GetInstance("C
这是我的 if else Ansible 逻辑.. - name: Check certs exist stat: path=/etc/letsencrypt/live/{{ rootDomain
我正在使用construct 2.8 对一些失传已久的 Pascal 程序创建的一些文件的 header 进行逆向工程。 header 由许多不同的记录组成,其中一些是可选的,我不确定顺序是否固定。
我在将 getchar() 的输入放入 char *arr[] 数组时遇到问题。我这样做的原因是因为输入数据(将是一个带有命令行参数的文件)将存储在一个 char 指针数组中以传递给 execvp 函
通常我们不能约束类型参数 T派生自密封类型(例如 struct 类型)。这将毫无意义,因为只有一种类型适合,因此不需要泛型。所以约束如下: where T : string 或: where T :
关闭。此题需要details or clarity 。目前不接受答案。 想要改进这个问题吗?通过 editing this post 添加详细信息并澄清问题. 已关闭 9 年前。 Improve th
#include using namespace std; class A { private: int m_i; friend int main(int argc, char cons
这个问题在这里已经有了答案: Are there legitimate uses for JavaScript's "with" statement? (33 个答案) 关闭 9 年前。 我有这个代
在this answer我看到了下一个 Bash 结构。 yes "$(< file.txt)" 什么意思 "$(< file.txt)" ? 我明白了 命令替换 - $(command)用命令的结
if (a == 1) //do something else if (a == 2) //do something else if (a == 3) //do somethi
关于构造的快速简单的问题。 我有以下用于将项目添加到 ListView 的代码。 ListViewItem item = new ListViewItem(); item.Text = file; i
我想使用 std::vector 来控制给定的内存。首先,我很确定这不是好的做法,但好奇心占了上风,无论如何我都想知道如何做到这一点。 我遇到的问题是这样的方法: vector getRow(unsi
下面显示了一段简单的javascript: var mystring = ("random","ignored","text","h") + ("ello world") 这个字符串会生成 hello
在 Java 中,创建对象的标准方法是使用 MyClass name = new MyClass(); 我也经常看到构造 new MyClass() { /*stuff goes in here*/
我正在编写 C++ ndarray 类。我需要动态大小和编译时大小已知的数组(分别分配自由存储和分配堆栈)。我想支持从嵌套的 std::initializer_list 进行初始化。 动态大小的没问题
我正在将一个项目从 Visual Studio 2005 转换为 Visual Studio 2008,并提出了上述结构。 using Castle.Core.Resource; using Cast
我想知道我在这里的想法是否正确,我主要针对接口(interface)进行编程,所以我想知道下面的类是否应该通过 DI 注入(inject),或者我应该自己实例化一个类... 注意:这些服务保存在我的核
我是一名优秀的程序员,十分优秀!