我正在使用HuggingFaceDistilBertTokenizer
的标记器。
我想通过在空间上简单地拆分来标记我的文本:
["Don't", "you", "love", "", "Transformers?", "We", "sure", "do."]
而不是默认行为,如下所示:
["Do", "n't", "you", "love", "", "Transformers", "?", "We", "sure", "do", "."]
我阅读了他们关于Tokenization的文档以及专门关于BERT Tokenizer的文档,但找不到这个简单问题的答案:(
我假设它应该是加载 Tokenizer 时的参数,但我在参数列表中找不到它...
编辑:重现的最小代码示例:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('distilbert-base-cased')
tokens = tokenizer.tokenize("Don't you love Transformers? We sure do.")
print("Tokens: ", tokens)