huggingface-tokenizers - 有没有办法使用带有 wordpiece 前缀的 Huggingface 预训练标记器？

Question

我正在和 Bert 一起做一个序列标记任务。为了将单词片段与标签对齐，我需要一些标记来识别它们，这样我就可以通过求和或平均得到每个单词的单个嵌入。

例如，我希望将单词New~york标记为New ##~ ##york，并查看互联网上的一些旧示例，这就是您之前使用 BertTokenizer 得到的，但显然不再是（他们的文档说）

所以当我运行时：

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

batch_sentences = ["hello, i'm testing this efauenufefu"]

inputs = tokenizer(batch_sentences, return_tensors="pt")

decoded = tokenizer.decode(inputs["input_ids"][0])

print(decoded)

我得到：

[CLS] hello, i'm testing this efauenufefu [SEP]

但是编码清楚地表明，最后的废话确实被分解成碎片......

In [4]: inputs
Out[4]: 
{'input_ids': tensor([[  101, 19082,   117,   178,   112,   182,  5193,  1142,   174,  8057,
         23404, 16205, 11470,  1358,   102]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

我还尝试使用BertTokenizerFast，与不同的是BertTokenizer，它允许您指定单词前缀：

tokenizer2 = BertTokenizerFast("bert-base-cased-vocab.txt", wordpieces_prefix = "##")
batch_sentences = ["hello, i'm testing this efauenufefu"]

inputs = tokenizer2(batch_sentences, return_tensors="pt")

decoded = tokenizer2.decode(inputs["input_ids"][0])

print(decoded)

然而解码器给了我完全相同的...

[CLS] hello, i'm testing this efauenufefu [SEP]

那么，有没有办法使用带有前缀的预训练 Huggingface 标记器，或者我必须自己训练自定义标记器？

score 1 · Accepted Answer

也许您正在寻找tokenize：

from transformers import BertTokenizerFast
t = BertTokenizerFast.from_pretrained('bert-base-uncased')
t.tokenize("hello, i'm testing this efauenufefu")

输出：

['hello',
 ',',
 'i',
 "'",
 'm',
 'testing',
 'this',
 'e',
 '##fa',
 '##uen',
 '##uf',
 '##ef',
 '##u']

您还可以获得每个标记到相关单词和其他内容的映射：

o = t("hello, i'm testing this efauenufefu", add_special_tokens=False, return_attention_mask=False, return_token_type_ids=False)
o.words()

输出：

[0, 1, 2, 3, 4, 5, 6, 7, 7, 7, 7, 7, 7]

huggingface-tokenizers - 有没有办法使用带有 wordpiece 前缀的 Huggingface 预训练标记器？

1 回答 1

Related

Reference