我的任务是这样的:
import gensim
from sklearn.feature_extraction.text import CountVectorizer
newsgroup_data = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
vect = CountVectorizer(stop_words='english',
token_pattern='(?u)\\b\\w\\w\\w+\\b')
X = vect.fit_transform(newsgroup_data)
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())
我的任务是估计语料库上的 LDA 模型参数,找到 10 个主题的列表以及每个主题中最重要的 10 个单词,我这样做是这样的:
top10 = ldamodel.print_topics(num_topics=10, num_words=10)
ldamodel = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id_map, num_topics=10, minimum_probability=0)
哪个通过了自动分级机罚款。下一个任务是找到一个新文档的主题分布,我尝试这样做如下:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]
newX = vect.transform(new_doc)
newC = gensim.matutils.Sparse2Corpus(newX, documents_columns=False)
print(ldamodel.get_document_topics(newC))
然而,这只是返回
gensim.interfaces.TransformedCorpus
我还从文档中看到了以下声明:“然后,您可以使用 >>> doc_lda = lda[doc_bow] 推断新的、看不见的文档的主题分布”,但在这里也没有成功。任何帮助表示赞赏。