我尝试使用 Google Colab 为 Kaggle 初始化的 Notebook 并发现了一个奇怪的行为,因为它给了我类似的东西:
16 # text2tensor
---> 17 train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
18 val_seq,val_mask,val_y = textToTensor(val_text,val_labels,pad_len)
19
<ipython-input-9-ee85c4607a30> in textToTensor(text, labels, max_len)
4 tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
5
----> 6 text_seq = torch.tensor(tokens['input_ids'])
7 text_mask = torch.tensor(tokens['attention_mask'])
8
ValueError: expected sequence of length 38 at dim 1 (got 13)
错误来自以下代码:
def textToTensor(text,labels=None,max_len=38):#max_len is 38
tokens = tokenizer.batch_encode_plus(text.tolist(), max_length=max_len, padding='max_length', truncation=True)
text_seq = torch.tensor(tokens['input_ids']) # ERROR CAME FROM HERE
text_mask = torch.tensor(tokens['attention_mask'])
text_y = None
if isinstance(labels,np.ndarray):
text_y = torch.tensor(labels.tolist())
return text_seq, text_mask, text_y
train_seq,train_mask,train_y = textToTensor(train_text,train_labels,pad_len)
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
我再次在COLAB上运行了这段代码,它运行得很顺利。可能是因为版本之类的吗?有人可以取悦这个吗?
Kaggle 配置:
transformers: '2.11.0'
torch: '1.5.1'
python: 3.7.6
Colab 配置:
torch: 1.7.0+cu101
transformers: 3.5.1
python: 3.6.9
编辑:我train_text
是一个 numpy 文本数组,train_labels
是一维数值数组,有 4 个类别,范围为 0-3。
另外:我将我的标记器初始化为:
from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')