我正在使用下面的代码在 python 中构造文档术语矩阵。
# Importing the libraries
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
dataset=pd.read_csv("trainset.csv",encoding = "ISO-8859-1")
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^\w\s]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]', ' ')
dataset['ProductDescription']=dataset['ProductDescription'].str.lower()
stop = set(stopwords.words('english'))
dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b(' + r'|'.join(stop) + r')\b\s*', ' ')
vectorizer = CountVectorizer()
x1 = vectorizer.fit_transform(dataset['ProductDescription'].values.astype('U'))
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
对于 10000 数据集,代码工作正常,但是当我考虑大约 1100000 的大型数据集时,执行时出现内存错误
df = pd.DataFrame(x1.toarray().transpose(), index = vectorizer.get_feature_names())
有人可以告诉我哪里出错了吗?