gpt4 book ai didi

python - LDA基因模拟。如何使用每个文档的正确主题编号更新 Postgres 数据库?

转载 作者:行者123 更新时间:2023-11-29 13:52:38 25 4
gpt4 key购买 nike

我从数据库中获取不同的文档,并使用 LDA (gensim) 检查这些文档中有哪些潜在主题。这很好用。我想做的是在数据库中为每个文档保存最可能的主题。而且我不确定什么是最好的解决方案。例如,我可以在开始时从数据库中提取每个文档的唯一 ID 以及 text_column 并以某种方式处理它,最后我知道哪个 ID 属于哪个主题编号。或者我应该在最后一部分打印文档及其主题。但我不知道如何将它连接回数据库。通过 text_column 与文档的比较并分配相应的主题编号?如有任何评论,我们将不胜感激。

stop = stopwords.words('english')

sql = """SELECT text_column FROM table where NULLIF(text_column, '') IS NOT NULL;"""
cur.execute(sql)
dbrows = cur.fetchall()
conn.commit()

documents = []
for i in dbrows:
documents = documents + list(i)

# remove all the words from the stoplist and tokenize
stoplist = stopwords.words('english')

additional_list = set("``;''".split(";"))

texts = [[word.lower() for word in document.split() if word.lower() not in stoplist and word not in string.punctuation and word.lower() not in additional_list]
for document in documents]

# remove words that appear less or equal of 2 times
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) <= 2)
texts = [[word for word in text if word not in tokens_once]
for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
my_num_topics = 10

# lda itself
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=my_num_topics)
corpus_lda = lda[corpus]

# print the most contributing words for selected topics
for top in lda.show_topics(my_num_topics):
print top

# print the most probable topic and the document
for l,t in izip(corpus_lda,documents):
selected_topic = max(l,key=lambda item:item[1])
if selected_topic[1] != 1/my_num_topics:
selected_topic_number = selected_topic[0]
print selected_topic
print t

最佳答案

正如 wildplasser 评论的那样,我只需要选择 id 和 text_column。我之前尝试过,但是我将数据附加到列表的方式不适合进一步处理。下面的代码有效,结果创建了一个表,其中包含 ID 和一些最可能的主题。

stop = stopwords.words('english')

sql = """SELECT id, text_column FROM table where NULLIF(text_column, '') IS NOT NULL;"""
cur.execute(sql)
dbrows = cur.fetchall()
conn.commit()

documents = []
for i in dbrows:
documents.append(i)

# remove all the words from the stoplist and tokenize
stoplist = stopwords.words('english')

additional_list = set("``;''".split(";"))

texts = [[word.lower() for word in document[1].split() if word.lower() not in stoplist and word not in string.punctuation and word.lower() not in additional_list]
for document in documents]

# remove words that appear less or equal of 2 times
all_tokens = sum(texts, [])
tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) <= 2)
texts = [[word for word in text if word not in tokens_once]
for text in texts]

dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
my_num_topics = 10

# lda itself
lda = ldamodel.LdaModel(corpus, id2word=dictionary, num_topics=my_num_topics)
corpus_lda = lda[corpus]

# print the most contributing words for selected topics
for top in lda.show_topics(my_num_topics):
print top

# print the most probable topic and the document
lda_topics = []
for l,t in izip(corpus_lda,documents):
selected_topic = max(l,key=lambda item:item[1])
if selected_topic[1] != 1/my_num_topics:
selected_topic_number = selected_topic[0]
lda_topics.append((selected_topic[0],int(t[0])))

cur.execute("""CREATE TABLE table_topic (id bigint PRIMARY KEY, topic int);""")
for j in lda_topics:
my_id = j[1]
topic = j[0]
cur.execute("INSERT INTO table_topic (id, topic) VALUES (%s, %s)", (my_id,topic))
conn.commit()

关于python - LDA基因模拟。如何使用每个文档的正确主题编号更新 Postgres 数据库?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38098824/

25 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com