python - 使用 CountVectorizer 为 LDA 主题模型准备数据集

Question

我想使用CountVectorizerfromScikit创建LDA模型使用的矩阵。但我的数据集是一系列编码术语，例如以下形式：

(1-2252, 5-5588, 10-5478, 2-9632 ....)

我怎么能告诉CountVectorizer考虑每对数据，即1-2252作为一个词

score 0 · Accepted Answer

幸运的是，我找到了一个有用的博客给了我答案。

当我使用以下方法对文本进行标记时：

import re
REGEX = re.compile(r",\s*")
def tokenize(text):
    return [tok.strip().lower() for tok in REGEX.split(text)]

并将标记器传递给CountVectorizer如下：

tf = CountVectorizer(tokenizer=tokenize)

1 回答 1