python - 将 counvectorizer() 用于 pandas 数据帧时，python 中的内存错误

翻译自：https://stackoverflow.com/questions/46723071 2017-10-13T05:27:50.023

498 次

我正在使用下面的代码在 python 中构造文档术语矩阵。

# Importing the libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer


dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1")
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^\w\s]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.lower()
stop = set(stopwords.words('english'))
dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
vectorizer = CountVectorizer()
x1 = vectorizer.fit_transform(dataset['ProductDescription'].values.astype('U'))
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())

对于 10000 数据集，代码工作正常，但是当我考虑大约 1100000 的大型数据集时，执行时出现内存错误

  df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())

有人可以告诉我哪里出错了吗？

python - 将 counvectorizer() 用于 pandas 数据帧时，python 中的内存错误

0 回答 0

Related

Reference