gpt4 book ai didi

apache-spark - Spark 2.1.1 : How to predict topics in unseen documents on already trained LDA model in Spark 2. 1.1?

转载 作者:行者123 更新时间:2023-11-30 09:04:37 25 4
gpt4 key购买 nike

我正在 pyspark (spark 2.1.1) 中根据客户评论数据集训练 LDA 模型。现在,基于该模型,我想预测新的看不见的文本中的主题。

我使用以下代码来制作模型

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext, Row
from pyspark.ml.feature import CountVectorizer
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, CountVectorizer, StopWordsRemover
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.ml.clustering import DistributedLDAModel, LocalLDAModel
from pyspark.mllib.linalg import Vector, Vectors
from pyspark.sql.functions import *
import pyspark.sql.functions as F


path = "D:/sparkdata/sample_text_LDA.txt"
sc = SparkContext("local[*]", "review")
spark = SparkSession.builder.appName('Basics').getOrCreate()
df = spark.read.csv("D:/sparkdata/customers_data.csv", header=True, inferSchema=True)

data = df.select("Reviews").rdd.map(list).map(lambda x: x[0]).zipWithIndex().map(lambda words: Row(idd= words[1], words = words[0].split(" "))).collect()

docDF = spark.createDataFrame(data)
remover = StopWordsRemover(inputCol="words",
outputCol="stopWordsRemoved")
stopWordsRemoved_df = remover.transform(docDF).cache()
Vector = CountVectorizer(inputCol="stopWordsRemoved", outputCol="vectors")
model = Vector.fit(stopWordsRemoved_df)
result = model.transform(stopWordsRemoved_df)
corpus = result.select("idd", "vectors").rdd.map(lambda x: [x[0],Vectors.fromML(x[1])]).cache()

# Cluster the documents topics using LDA
ldaModel = LDA.train(corpus, k=3,maxIterations=100,optimizer='online')
topics = ldaModel.topicsMatrix()
vocabArray = model.vocabulary
print(ldaModel.describeTopics())
wordNumbers = 10 # number of words per topic
topicIndices = sc.parallelize(ldaModel.describeTopics(maxTermsPerTopic = wordNumbers))
def topic_render(topic): # specify vector id of words to actual words
terms = topic[0]
result = []
for i in range(wordNumbers):
term = vocabArray[terms[i]]
result.append(term)
return result

topics_final = topicIndices.map(lambda topic: topic_render(topic)).collect()

for topic in range(len(topics_final)):
print("Topic" + str(topic) + ":")
for term in topics_final[topic]:
print (term)
print ('\n')

现在我有一个数据框,其中有一列包含新的客户评论,我想预测它们属于哪个主题集群。我搜索过答案,大多推荐以下方式,如这里Spark MLlib LDA, how to infer the topics distribution of a new unseen document?

newDocuments: RDD[(Long, Vector)] = ...
topicDistributions = distLDA.toLocal.topicDistributions(newDocuments)

但是,我收到以下错误:

“LDAModel”对象没有属性“toLocal”。它也没有 topicDistribution 属性。

那么spark 2.1.1不支持这些属性吗?

还有其他方法可以从看不见的数据中推断主题吗?

最佳答案

您需要预处理新数据:

# import a new data set to be passed through the pre-trained LDA

data_new = pd.read_csv('YourNew.csv', encoding = "ISO-8859-1");
data_new = data_new.dropna()
data_text_new = data_new[['Your Target Column']]
data_text_new['index'] = data_text_new.index

documents_new = data_text_new
#documents_new = documents.dropna(subset=['Preprocessed Document'])

# process the new data set through the lemmatization, and stopwork functions
processed_docs_new = documents_new['Preprocessed Document'].map(preprocess)

# create a dictionary of individual words and filter the dictionary
dictionary_new = gensim.corpora.Dictionary(processed_docs_new[:])
dictionary_new.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

# define the bow_corpus
bow_corpus_new = [dictionary_new.doc2bow(doc) for doc in processed_docs_new]

然后你可以将它作为函数传递给经过训练的LDA。您所需要的只是 Bow_corpus:

ldamodel[bow_corpus_new[:len(bow_corpus_new)]]

如果您想以 csv 格式输出,请尝试以下操作:

a = ldamodel[bow_corpus_new[:len(bow_corpus_new)]]
b = data_text_new

topic_0=[]
topic_1=[]
topic_2=[]

for i in a:
topic_0.append(i[0][1])
topic_1.append(i[1][1])
topic_2.append(i[2][1])

d = {'Your Target Column': b['Your Target Column'].tolist(),
'topic_0': topic_0,
'topic_1': topic_1,
'topic_2': topic_2}

df = pd.DataFrame(data=d)
df.to_csv("YourAllocated.csv", index=True, mode = 'a')

我希望这有帮助:)

关于apache-spark - Spark 2.1.1 : How to predict topics in unseen documents on already trained LDA model in Spark 2. 1.1?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55808041/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com