python-3.x - CountVectorizer 中的 Tokenizer 运行多次

翻译自：https://stackoverflow.com/questions/45841402 2017-08-23T13:40:17.250

346 次

当 python 运行 CountVectorizer 时，它会调用 tokenizer 超过 1 次，准确地说是很多次，所以下面的代码运行如下：

分割空间

词干标记

过滤令牌

分割空间

词干标记

过滤令牌

分割空间

...

这使得代码运行太慢。我无法弄清楚为什么矢量化器会这样运行，以及如何防止它并加快代码速度？

#import stop_words dictionary    
dict=[]
with open("Stopwords.txt") as f:
    for word in f:
        dict.append(word.replace("\n",""))

#import stem dictionary
d = {}
with open("Stem rečnik.txt") as f:
    for line in f:
       key, val = line.split(":")
       d[re.compile(key)] = val.replace("\n","")
    def custom_tokenizer(text):
            #split- space
            tokens = ' '.join(text).split()

    #   stem tokens
    for i, token in enumerate(tokens):
        for key, val in d.items():
                if re.match(key, token):
                    tokens[i] = val

    #filter tokens
    tokens = [token for token in tokens if len(token)>=3]
    return tokens

cv=CountVectorizer(tokenizer= custom_tokenizer,analyzer ='word',encoding='utf-8',min_df=0, max_df=1.0, stop_words=frozenset(dict))

post_text_trainCV= cv.fit_transform(post_text_train)

我检查了，问题不在于标记器功能，它运行正常，但是当我运行最后一行时，它多次调用标记器。

python-3.x - CountVectorizer 中的 Tokenizer 运行多次

0 回答 0

Related

Reference