python - 使用 Countvectorizer 提取 ngram 时如何解决内存问题？

翻译自：https://stackoverflow.com/questions/43625253 2017-04-26T04:46:30.173

442 次

我有一个大小为 300 MB 的语料库。我有 32 位 Windows 和 32 位 python 版本 3.6。此操作需要多少内存？我的代码如下。

a = load_files('D:\Train') # have two sub folders.
vectorizer = CountVectorizer(ngram_range=(4,4),binary=True)
X = vectorizer.fit_transform(a.data)

错误：

File "D:/spyder/april.py", line 32, in <module>
X = vectorizer.fit_transform(a.data)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 762, in _count_vocab
for feature in analyze(doc):

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 241, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)

File "C:\Users\banu\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 141, in _word_ngrams
tokens.append(" ".join(original_tokens[i: i + n]))

MemoryError

我在谷歌搜索解决方案。他们提出了使用散列矢量化器的想法。但有人提到它没有给出各自的标记和特征名称。计数向量器将提供特征和索引。请为我提供 Count vectorizer 本身的解决方案。

python - 使用 Countvectorizer 提取 ngram 时如何解决内存问题？

0 回答 0

Related

Reference