我正在尝试读取 100 个训练文件并使用 sklean 对它们进行矢量化。这些文件的内容是代表系统调用的单词。一旦矢量化,我想将矢量打印出来。我的第一次尝试如下:
import re
import os
import sys
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import numpy as np
import numpy.linalg as LA
trainingdataDir = 'C:\data\Training data'
def readfile():
for file in os.listdir(trainingdataDir):
trainingfiles = os.path.join(trainingdataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, "rb").read()
return data
train_set = [readfile()]
vectorizer = CountVectorizer()
transformer = TfidfTransformer()
trainVectorizerArray = vectorizer.fit_transform(train_set).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
但是,这只返回最后一个文件的向量。我的结论是打印函数应该放在for循环中。所以第二次尝试:
def readfile():
for file in os.listdir(trainingdataDir):
trainingfiles = os.path.join(trainingdataDir, file)
if os.path.isfile(trainingfiles):
data = open(trainingfiles, "rb").read()
trainVectorizerArray = vectorizer.fit_transform(data).toarray()
print 'Fit Vectorizer to train set', trainVectorizerArray
但是,这不会返回任何内容。你能帮我解决这个问题吗?为什么我看不到打印出来的向量?