gpt4 book ai didi

python - 使用 Python (numpy) 实现主题模型

转载 作者:太空狗 更新时间:2023-10-30 01:30:02 26 4
gpt4 key购买 nike

最近,我使用 numpy 在 Python 上实现了 LDA 主题模型的 Gibbs 采样,引用了网站上的一些代码。在 Gibbs 采样的每次迭代中,我们删除一个(当前)词,根据从 LDA 模型推断的后验条件概率分布为该词采样一个新主题,并更新词主题计数,如下所示:

for m, doc in enumerate(docs): #m: doc id
for n, t in enumerate(doc): #n: id of word inside document, t: id of the word globally
# discount counts for word t with associated topic z
z = z_m_n[m][n]
n_m_z[m][z] -= 1
n_z_t[z, t] -= 1
n_z[z] -= 1
n_m[m] -= 1

# sample new topic for multinomial
p_z_left = (n_z_t[:, t] + beta) / (n_z + V * beta)
p_z_right = (n_m_z[m] + alpha) / ( n_m[m] + alpha * K)
p_z = p_z_left * p_z_right
p_z /= numpy.sum(p_z)
new_z = numpy.random.multinomial(1, p_z).argmax()

# set z as the new topic and increment counts
z_m_n[m][n] = new_z
n_m_z[m][new_z] += 1
n_z_t[new_z, t] += 1
n_z[new_z] += 1
n_m[m] += 1

在上面的代码中,我们使用多项式 scipy 函数对一个新的(单个)z 进行采样。

现在,我想实现 this paper 的联合情感主题模型.现在,我需要以下结构来跟踪所需的计数:

3D matrix containing # occurrences for a word for each topic, for each sentiment
3D matrix containing # occurrences for a topic, for each sentiment, for each document
2D matrix containing # occurrences for a topic, for each sentiment
2D matrix containing # occurrences for a sentiment for each document

现在问题来了:在这个 Gibbs 采样器中,对于在文档中看到的每个单词,一个新主题和一个情感标签现在都从条件后验中采样(本文第 4 页等式 5)。我现在如何在 Python 中“采样这两个值”?

提前致谢...

最佳答案

试试这个。从主题和情感标签的联合分布中抽样意味着整个 T x S 矩阵的总和应为 1。

docs=[[0,1],[0,0],[1,0,1]]
D=len(docs)
z_d_n=[[0 for _ in xrange(len(d))] for d in docs]
l_d_n=[[0 for _ in xrange(len(d))] for d in docs]

V=2
T=2
S=2
n_m_j_k=numpy.zeros( (V,T,S) )
n_j_k_d=numpy.zeros( (T,S,D) )
n_j_k=numpy.zeros( (T,S) )
n_k_d=numpy.zeros( (S,D) )
n_d=numpy.zeros( (D) )

beta=.1
alpha=.1
gamma=.1

for d, doc in enumerate(docs): #d: doc id
for n, m in enumerate(doc): #i: index of the word inside document, m: id of the word in the vocabulary
# j is the topic
j = z_d_n[d][n]
# k is the sentiment
k = l_d_n[d][n]
n_m_j_k[m][j][k] += 1
n_j_k_d[j][k][d] += 1
n_j_k[j][k] += 1
n_k_d[k][d] += 1
n_d[d] += 1

for d, doc in enumerate(docs): #d: doc id
for n, m in enumerate(doc): #i: index of the word inside document, m: id of the word in the vocabulary
# j is the topic
j = z_d_n[d][n]
# k is the sentiment
k = l_d_n[d][n]
n_m_j_k[m][j][k] -= 1
n_j_k_d[j][k][d] -= 1
n_j_k[j][k] -= 1
n_k_d[k][d] -= 1
n_d[d] -= 1

# sample a new topic and sentiment label jointly
# T is the number of topics
# S is the number of sentiments
p_left = (n_m_j_k[m] + beta) / (n_j_k + V * beta) # T x S array
p_mid = (n_j_k_d[:,:,d] + alpha) / numpy.tile(n_k_d[:,d] + T * alpha, (T,1) )
p_right = numpy.tile(n_k_d[:,d] + gamma,(T,1)) / numpy.tile(n_d[d] + S * gamma,(T,S))
p = p_left * p_mid * p_right
p /= numpy.sum(p)
new_jk = numpy.random.multinomial(1, numpy.reshape(p, (T*S) )).argmax()
j=new_jk/T
k=new_jk%T

z_d_n[d][n]=j
l_d_n[d][n]=k
n_m_j_k[m][j][k] += 1
n_j_k[j][k] += 1
n_k_d[k][d] += 1
n_d[d] += 1

关于python - 使用 Python (numpy) 实现主题模型,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/10519690/

26 4 0
Copyright 2021 - 2024 cfsdn All Rights Reserved 蜀ICP备2022000587号
广告合作:1813099741@qq.com 6ren.com