2

我正在使用 Spacy 的内置模型,en_core_web_lg并且希望使用我的自定义实体对其进行训练。在这样做的同时,我面临两个问题,

  1. 它用旧的数据覆盖新的训练数据,导致无法识别其他实体。例如,在训练之前,它可以识别 PERSON 和 ORG,但在训练之后它不能识别 PERSON 和 ORG。

  2. 在训练过程中,它给了我以下错误,

UserWarning: [W030] Some entities could not be aligned in the text "('I work in Google.',)" with entities "[(9, 15, 'ORG')]". Use `spacy.training.offsets_to_biluo_tags(nlp.make_doc(text), entities)` to check the alignment. Misaligned entities ('-') will be ignored during training.

这是我的整个代码,

import spacy
import random
from spacy.util import minibatch, compounding
from pathlib import Path
from spacy.training.example import Example
sentence = ""
body1 = "James work in Facebook and love to have tuna fishes in the breafast."
nlp_lg = spacy.load("en_core_web_lg")
print(nlp_lg.pipe_names)
doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


train = [
    ('I had tuna fish in breakfast', {'entities': [(6,14,'FOOD')]}),
    ('I love prawns the most', {'entities': [(6,12,'FOOD')]}),
    ('fish is the rich source of protein', {'entities': [(0,4,'FOOD')]}),
    ('I work in Google.', {'entities': [(9,15,'ORG')]})
    ]


ner = nlp_lg.get_pipe("ner")

for _, annotations in train:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

disable_pipes = [pipe for pipe in nlp_lg.pipe_names if pipe != 'ner']

with nlp_lg.disable_pipes(*disable_pipes):
    optimizer = nlp_lg.resume_training()
    for interation in range(30):
        random.shuffle(train)
        losses = {}

        batches = minibatch(train, size=compounding(1.0,4.0,1.001))
        for batch in batches:
            text, annotation = zip(*batch)
            doc1 = nlp_lg.make_doc(str(text))
            example = Example.from_dict(doc1, annotations)
            nlp_lg.update(
                [example],
                drop = 0.5,
                losses = losses,
                sgd = optimizer
                )
            print("Losses",losses)

doc = nlp_lg(body1)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

预期输出:

James 0 5 PERSON
Facebook 14 22 ORG
tuna fishes 40 51 FOOD

目前没有识别实体..

请让我知道我在哪里做错了。谢谢!

4

1 回答 1

1

您描述的“覆盖”称为“灾难性遗忘”,spaCy 博客上有一篇关于它的帖子。没有完美的解决方法,但我们在这里有一个最近的修复。

关于您的对齐错误...

"('我在 Google 工作。',)" 与实体 "[(9, 15, 'ORG')]"

您的字符偏移已关闭。

"I work in Google."[9:15]
# => " Googl"

也许它们偏离了一个常数值,您可以通过在所有内容中添加一个来解决此问题,但您需要查看您的数据来弄清楚这一点。

于 2021-06-21T05:52:23.507 回答