gpt4 book ai didi

python - Spark MLLib 的问题导致所有事物的概率和预测都相同

转载 作者:可可西里 更新时间:2023-11-01 14:51:35 26 4
gpt4 key购买 nike

我正在学习如何将机器学习与 Spark MLLib 结合使用,目的是对推文进行情感分析。我从这里得到了一个情绪分析数据集: http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip

该数据集包含 100 万条归类为正面或负面的推文。该数据集的第二列包含情绪,第四列包含推文。

这是我当前的 PySpark 代码:

import csv
from pyspark.sql import Row
from pyspark.sql.functions import rand
from pyspark.ml.feature import Tokenizer
from pyspark.ml.feature import StopWordsRemover
from pyspark.ml.feature import Word2Vec
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.classification import LogisticRegression

data = sc.textFile("/home/omar/sentiment-train.csv")
header = data.first()
rdd = data.filter(lambda row: row != header)

r = rdd.mapPartitions(lambda x : csv.reader(x))
r2 = r.map(lambda x: (x[3], int(x[1])))

parts = r2.map(lambda x: Row(sentence=x[0], label=int(x[1])))
partsDF = spark.createDataFrame(parts)
partsDF = partsDF.orderBy(rand()).limit(10000)

tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
tokenized = tokenizer.transform(partsDF)

remover = StopWordsRemover(inputCol="words", outputCol="base_words")
base_words = remover.transform(tokenized)

train_data_raw = base_words.select("base_words", "label")

word2Vec = Word2Vec(vectorSize=100, minCount=0, inputCol="base_words", outputCol="features")

model = word2Vec.fit(train_data_raw)
final_train_data = model.transform(train_data_raw)
final_train_data = final_train_data.select("label", "features")

lr = LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(final_train_data)

lrModel.transform(final_train_data).show()

我正在使用以下命令在 PySpark 交互式 shell 上执行此操作:

pyspark --master yarn --deploy-mode client --conf='spark.executorEnv.PYTHONHASHSEED=223'

(仅供引用:我有一个 HDFS 集群,其中包含 10 个带有 YARN、Spark 等的虚拟机)

最后一行代码的结果是:

>>> lrModel.transform(final_train_data).show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 1|[0.00885206627292...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02994908031541...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.03443818541709...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02838905728422...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00561632859171...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02029798456545...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.02020387646293...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01861085715063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00212163510598...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01254413221031...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.01443821341672...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02591390228879...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.00590923184063...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02487089103516...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00999667861365...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00416736607439...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.00715923445144...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02524911996890...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 1|[0.01635813603934...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
| 0|[0.02773649083489...|[-0.0332030500349...|[0.4917,0.5083000...| 1.0|
+-----+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

如果我对手动创建的较小数据集执行相同操作,它就可以工作。我不知道发生了什么,整天都在处理这个问题。

有什么建议吗?

感谢您的宝贵时间!

最佳答案

TL;DR 十次迭代对于任何现实生活中的应用程序来说都太低了。在大型且非平凡的数据集上,可能需要数千次或更多次迭代(以及调整剩余参数)才能收敛。

二项式 LogisticRegressionModelsummary属性,它可以让您访问 LogisticRegressionSummary目的。在其他有用的指标中,它包含可用于调试训练过程的 objectiveHistory:

import matplotlib.pyplot as plt

lrm = LogisticRegression(..., family="binomial").fit(df)
plt.plot(lrm.summary.objectiveHistory)

plt.show()

关于python - Spark MLLib 的问题导致所有事物的概率和预测都相同,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44480077/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com