python-3.x - 使用 countvectorizer 训练的 gensim ldamodel 中的主题分布

Question

我的任务是这样的：

import gensim
from sklearn.feature_extraction.text import CountVectorizer

newsgroup_data = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

vect = CountVectorizer(stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
X = vect.fit_transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())

我的任务是估计语料库上的 LDA 模型参数，找到 10 个主题的列表以及每个主题中最重要的 10 个单词，我这样做是这样的：

top10 = ldamodel.print_topics(num_topics=10, num_words=10)
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus, 
              id2word=id_map, num_topics=10, minimum_probability=0)

哪个通过了自动分级机罚款。下一个任务是找到一个新文档的主题分布，我尝试这样做如下：

new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
newX = vect.transform(new_doc)
newC = gensim.matutils.Sparse2Corpus(newX, documents_columns=False)
print(ldamodel.get_document_topics(newC))

然而，这只是返回

gensim.interfaces.TransformedCorpus

我还从文档中看到了以下声明：“然后，您可以使用 >>> doc_lda = lda[doc_bow] 推断新的、看不见的文档的主题分布”，但在这里也没有成功。任何帮助表示赞赏。

score 1 · Accepted Answer

继续深入研究，特别是接口 gensim.interfaces.TransformedCorpus。据我了解，该界面指向我要求的主题/发行版，但我需要遍历它以查看值。

topic_dist = ldamodel.get_document_topics(newC)
td=[]
for topic in topic_dis:
   td.append(topic)
td = td[0]

成功了。也可以使用

topic_dist = ldamodel[newC]

python-3.x - 使用 countvectorizer 训练的 gensim ldamodel 中的主题分布

1 回答 1

Related

Reference