您可以将其作为自定义preprocessor
来实现,而不是扩展停用词列表CountVectorizer
。下面是一个简单的版本,如图所示bpython
。
>>> import re
>>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
>>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1),
preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
>>> cv.vocabulary_
{u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}
预编译正则表达式可能会给大量样本带来一些加速。