spaCy/bin/wiki_entity_linking/train_descriptions.py
Sofie Van Landeghem 569cc98982
Update spaCy for thinc 8.0.0 (#4920)
* Add load_from_config function

* Add train_from_config script

* Merge configs and expose via spacy.config

* Fix script

* Suggest create_evaluation_callback

* Hard-code for NER

* Fix errors

* Register command

* Add TODO

* Update train-from-config todos

* Fix imports

* Allow delayed setting of parser model nr_class

* Get train-from-config working

* Tidy up and fix scores and printing

* Hide traceback if cancelled

* Fix weighted score formatting

* Fix score formatting

* Make output_path optional

* Add Tok2Vec component

* Tidy up and add tok2vec_tensors

* Add option to copy docs in nlp.update

* Copy docs in nlp.update

* Adjust nlp.update() for set_annotations

* Don't shuffle pipes in nlp.update, decruft

* Support set_annotations arg in component update

* Support set_annotations in parser update

* Add get_gradients method

* Add get_gradients to parser

* Update errors.py

* Fix problems caused by merge

* Add _link_components method in nlp

* Add concept of 'listeners' and ControlledModel

* Support optional attributes arg in ControlledModel

* Try having tok2vec component in pipeline

* Fix tok2vec component

* Fix config

* Fix tok2vec

* Update for Example

* Update for Example

* Update config

* Add eg2doc util

* Update and add schemas/types

* Update schemas

* Fix nlp.update

* Fix tagger

* Remove hacks from train-from-config

* Remove hard-coded config str

* Calculate loss in tok2vec component

* Tidy up and use function signatures instead of models

* Support union types for registry models

* Minor cleaning in Language.update

* Make ControlledModel specifically Tok2VecListener

* Fix train_from_config

* Fix tok2vec

* Tidy up

* Add function for bilstm tok2vec

* Fix type

* Fix syntax

* Fix pytorch optimizer

* Add example configs

* Update for thinc describe changes

* Update for Thinc changes

* Update for dropout/sgd changes

* Update for dropout/sgd changes

* Unhack gradient update

* Work on refactoring _ml

* Remove _ml.py module

* WIP upgrade cli scripts for thinc

* Move some _ml stuff to util

* Import link_vectors from util

* Update train_from_config

* Import from util

* Import from util

* Temporarily add ml.component_models module

* Move ml methods

* Move typedefs

* Update load vectors

* Update gitignore

* Move imports

* Add PrecomputableAffine

* Fix imports

* Fix imports

* Fix imports

* Fix missing imports

* Update CLI scripts

* Update spacy.language

* Add stubs for building the models

* Update model definition

* Update create_default_optimizer

* Fix import

* Fix comment

* Update imports in tests

* Update imports in spacy.cli

* Fix import

* fix obsolete thinc imports

* update srsly pin

* from thinc to ml_datasets for example data such as imdb

* update ml_datasets pin

* using STATE.vectors

* small fix

* fix Sentencizer.pipe

* black formatting

* rename Affine to Linear as in thinc

* set validate explicitely to True

* rename with_square_sequences to with_list2padded

* rename with_flatten to with_list2array

* chaining layernorm

* small fixes

* revert Optimizer import

* build_nel_encoder with new thinc style

* fixes using model's get and set methods

* Tok2Vec in component models, various fixes

* fix up legacy tok2vec code

* add model initialize calls

* add in build_tagger_model

* small fixes

* setting model dims

* fixes for ParserModel

* various small fixes

* initialize thinc Models

* fixes

* consistent naming of window_size

* fixes, removing set_dropout

* work around Iterable issue

* remove legacy tok2vec

* util fix

* fix forward function of tok2vec listener

* more fixes

* trying to fix PrecomputableAffine (not succesful yet)

* alloc instead of allocate

* add morphologizer

* rename residual

* rename fixes

* Fix predict function

* Update parser and parser model

* fixing few more tests

* Fix precomputable affine

* Update component model

* Update parser model

* Move backprop padding to own function, for test

* Update test

* Fix p. affine

* Update NEL

* build_bow_text_classifier and extract_ngrams

* Fix parser init

* Fix test add label

* add build_simple_cnn_text_classifier

* Fix parser init

* Set gpu off by default in example

* Fix tok2vec listener

* Fix parser model

* Small fixes

* small fix for PyTorchLSTM parameters

* revert my_compounding hack (iterable fixed now)

* fix biLSTM

* Fix uniqued

* PyTorchRNNWrapper fix

* small fixes

* use helper function to calculate cosine loss

* small fixes for build_simple_cnn_text_classifier

* putting dropout default at 0.0 to ensure the layer gets built

* using thinc util's set_dropout_rate

* moving layer normalization inside of maxout definition to optimize dropout

* temp debugging in NEL

* fixed NEL model by using init defaults !

* fixing after set_dropout_rate refactor

* proper fix

* fix test_update_doc after refactoring optimizers in thinc

* Add CharacterEmbed layer

* Construct tagger Model

* Add missing import

* Remove unused stuff

* Work on textcat

* fix test (again :)) after optimizer refactor

* fixes to allow reading Tagger from_disk without overwriting dimensions

* don't build the tok2vec prematuraly

* fix CharachterEmbed init

* CharacterEmbed fixes

* Fix CharacterEmbed architecture

* fix imports

* renames from latest thinc update

* one more rename

* add initialize calls where appropriate

* fix parser initialization

* Update Thinc version

* Fix errors, auto-format and tidy up imports

* Fix validation

* fix if bias is cupy array

* revert for now

* ensure it's a numpy array before running bp in ParserStepModel

* no reason to call require_gpu twice

* use CupyOps.to_numpy instead of cupy directly

* fix initialize of ParserModel

* remove unnecessary import

* fixes for CosineDistance

* fix device renaming

* use refactored loss functions (Thinc PR 251)

* overfitting test for tagger

* experimental settings for the tagger: avoid zero-init and subword normalization

* clean up tagger overfitting test

* use previous default value for nP

* remove toy config

* bringing layernorm back (had a bug - fixed in thinc)

* revert setting nP explicitly

* remove setting default in constructor

* restore values as they used to be

* add overfitting test for NER

* add overfitting test for dep parser

* add overfitting test for textcat

* fixing init for linear (previously affine)

* larger eps window for textcat

* ensure doc is not None

* Require newer thinc

* Make float check vaguer

* Slop the textcat overfit test more

* Fix textcat test

* Fix exclusive classes for textcat

* fix after renaming of alloc methods

* fixing renames and mandatory arguments (staticvectors WIP)

* upgrade to thinc==8.0.0.dev3

* refer to vocab.vectors directly instead of its name

* rename alpha to learn_rate

* adding hashembed and staticvectors dropout

* upgrade to thinc 8.0.0.dev4

* add name back to avoid warning W020

* thinc dev4

* update srsly

* using thinc 8.0.0a0 !

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2020-01-29 17:06:46 +01:00

150 lines
5.0 KiB
Python

# coding: utf-8
from random import shuffle
import logging
import numpy as np
from thinc.model import Model
from thinc.api import chain
from thinc.loss import CosineDistance
from thinc.layers import Linear
from spacy.util import create_default_optimizer
logger = logging.getLogger(__name__)
class EntityEncoder:
"""
Train the embeddings of entity descriptions to fit a fixed-size entity vector (e.g. 64D).
This entity vector will be stored in the KB, for further downstream use in the entity model.
"""
DROP = 0
BATCH_SIZE = 1000
# Set min. acceptable loss to avoid a 'mean of empty slice' warning by numpy
MIN_LOSS = 0.01
# Reasonable default to stop training when things are not improving
MAX_NO_IMPROVEMENT = 20
def __init__(self, nlp, input_dim, desc_width, epochs=5):
self.nlp = nlp
self.input_dim = input_dim
self.desc_width = desc_width
self.epochs = epochs
self.distance = CosineDistance(ignore_zeros=True, normalize=False)
def apply_encoder(self, description_list):
if self.encoder is None:
raise ValueError("Can not apply encoder before training it")
batch_size = 100000
start = 0
stop = min(batch_size, len(description_list))
encodings = []
while start < len(description_list):
docs = list(self.nlp.pipe(description_list[start:stop]))
doc_embeddings = [self._get_doc_embedding(doc) for doc in docs]
enc = self.encoder(np.asarray(doc_embeddings))
encodings.extend(enc.tolist())
start = start + batch_size
stop = min(stop + batch_size, len(description_list))
logger.info("Encoded: {} entities".format(stop))
return encodings
def train(self, description_list, to_print=False):
processed, loss = self._train_model(description_list)
if to_print:
logger.info(
"Trained entity descriptions on {} ".format(processed) +
"(non-unique) descriptions across {} ".format(self.epochs) +
"epochs"
)
logger.info("Final loss: {}".format(loss))
def _train_model(self, description_list):
best_loss = 1.0
iter_since_best = 0
self._build_network(self.input_dim, self.desc_width)
processed = 0
loss = 1
# copy this list so that shuffling does not affect other functions
descriptions = description_list.copy()
to_continue = True
for i in range(self.epochs):
shuffle(descriptions)
batch_nr = 0
start = 0
stop = min(self.BATCH_SIZE, len(descriptions))
while to_continue and start < len(descriptions):
batch = []
for descr in descriptions[start:stop]:
doc = self.nlp(descr)
doc_vector = self._get_doc_embedding(doc)
batch.append(doc_vector)
loss = self._update(batch)
if batch_nr % 25 == 0:
logger.info("loss: {} ".format(loss))
processed += len(batch)
# in general, continue training if we haven't reached our ideal min yet
to_continue = loss > self.MIN_LOSS
# store the best loss and track how long it's been
if loss < best_loss:
best_loss = loss
iter_since_best = 0
else:
iter_since_best += 1
# stop learning if we haven't seen improvement since the last few iterations
if iter_since_best > self.MAX_NO_IMPROVEMENT:
to_continue = False
batch_nr += 1
start = start + self.BATCH_SIZE
stop = min(stop + self.BATCH_SIZE, len(descriptions))
return processed, loss
@staticmethod
def _get_doc_embedding(doc):
indices = np.zeros((len(doc),), dtype="i")
for i, word in enumerate(doc):
if word.orth in doc.vocab.vectors.key2row:
indices[i] = doc.vocab.vectors.key2row[word.orth]
else:
indices[i] = 0
word_vectors = doc.vocab.vectors.data[indices]
doc_vector = np.mean(word_vectors, axis=0)
return doc_vector
def _build_network(self, orig_width, hidden_with):
with Model.define_operators({">>": chain}):
# very simple encoder-decoder model
self.encoder = Linear(hidden_with, orig_width)
# TODO: removed the zero_init here - is oK?
self.model = self.encoder >> Linear(orig_width, hidden_with)
self.sgd = create_default_optimizer()
def _update(self, vectors):
truths = self.model.ops.asarray(vectors)
predictions, bp_model = self.model.begin_update(
truths, drop=self.DROP
)
d_scores, loss = self.distance(predictions, truths)
bp_model(d_scores, sgd=self.sgd)
return loss / len(vectors)