我正在研究 HuggingFace 的迁移学习功能(特别是命名实体识别)。作为序言,我对 Transformer 架构有点陌生。我从他们的网站上简要介绍了他们的示例:
from transformers import pipeline
nlp = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge which is visible from the window."
print(nlp(sequence))
我想做的是在本地保存并运行它,而不必每次都下载“ner”模型(大小超过 1 GB)。在他们的文档中,我看到您可以使用“pipeline.save_pretrained()”函数将管道保存到本地文件夹。结果是我存储到特定文件夹中的各种文件。
我的问题是如何在保存后将此模型重新加载到脚本中以继续按照上面的示例进行分类?“pipeline.save_pretrained()”的输出是多个文件。
这是我到目前为止所尝试的:
1:遵循有关管道的文档
pipe = transformers.TokenClassificationPipeline(model="pytorch_model.bin", tokenizer='tokenizer_config.json')
我得到的错误是:“str”对象没有属性“config”
2:按照 ner 上的 HuggingFace 示例:
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
model = AutoModelForTokenClassification.from_pretrained("path to folder following .save_pretrained()")
tokenizer = AutoTokenizer.from_pretrained("path to folder following .save_pretrained()")
label_list = [
"O", # Outside of a named entity
"B-MISC", # Beginning of a miscellaneous entity right after another miscellaneous entity
"I-MISC", # Miscellaneous entity
"B-PER", # Beginning of a person's name right after another person's name
"I-PER", # Person's name
"B-ORG", # Beginning of an organisation right after another organisation
"I-ORG", # Organisation
"B-LOC", # Beginning of a location right after another location
"I-LOC" # Location
]
sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
"close to the Manhattan Bridge."
# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])
这会产生错误:列表索引超出范围
我还尝试仅打印不返回标记的文本格式及其实体的预测。
任何帮助将非常感激!