我正在使用 Spacy 的内置模型,en_core_web_lg
并且希望使用我的自定义实体对其进行训练。在这样做的同时,我面临两个问题,
它用旧的数据覆盖新的训练数据,导致无法识别其他实体。例如,在训练之前,它可以识别 PERSON 和 ORG,但在训练之后它不能识别 PERSON 和 ORG。
在训练过程中,它给了我以下错误,
UserWarning: [W030] Some entities could not be aligned in the text "('I work in Google.',)" with entities "[(9, 15, 'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.
这是我的整个代码,
import spacy
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
train = [
('I had tuna fish in breakfast', {'entities': [(6,14,'FOOD')]}),
('I love prawns the most', {'entities': [(6,12,'FOOD')]}),
('fish is the rich source of protein', {'entities': [(0,4,'FOOD')]}),
('I work in Google.', {'entities': [(9,15,'ORG')]})
]
ner = nlp_lg.get_pipe("ner")
for _, annotations in train:
for ent in annotations.get("entities"):
ner.add_label(ent[2])
disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']
with nlp_lg.disable_pipes(*disable_pipes):
optimizer = nlp_lg.resume_training()
for interation in range(30):
random.shuffle(train)
losses = {}
batches = minibatch(train, size=compounding(1.0,4.0,1.001))
for batch in batches:
text, annotation = zip(*batch)
doc1 = nlp_lg.make_doc(str(text))
example = Example.from_dict(doc1, annotations)
nlp_lg.update(
[example],
drop = 0.5,
losses = losses,
sgd = optimizer
)
print("Losses",losses)
doc = nlp_lg(body1)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
预期输出:
James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD
目前没有识别实体..
请让我知道我在哪里做错了。谢谢!