当 python 运行 CountVectorizer 时,它会调用 tokenizer 超过 1 次,准确地说是很多次,所以下面的代码运行如下:
分割空间
词干标记
过滤令牌
分割空间
词干标记
过滤令牌
分割空间
...
这使得代码运行太慢。我无法弄清楚为什么矢量化器会这样运行,以及如何防止它并加快代码速度?
#import stop_words dictionary
dict=[]
with open("Stopwords.txt") as f:
for word in f:
dict.append(word.replace("\n",""))
#import stem dictionary
d = {}
with open("Stem rečnik.txt") as f:
for line in f:
key, val = line.split(":")
d[re.compile(key)] = val.replace("\n","")
def custom_tokenizer(text):
#split- space
tokens = ' '.join(text).split()
# stem tokens
for i, token in enumerate(tokens):
for key, val in d.items():
if re.match(key, token):
tokens[i] = val
#filter tokens
tokens = [token for token in tokens if len(token)>=3]
return tokens
cv=CountVectorizer(tokenizer= custom_tokenizer,analyzer ='word',encoding='utf-8',min_df=0, max_df=1.0, stop_words=frozenset(dict))
post_text_trainCV= cv.fit_transform(post_text_train)
我检查了,问题不在于标记器功能,它运行正常,但是当我运行最后一行时,它多次调用标记器。