spaCy/website/docs/usage/training.jade

139 lines
5.5 KiB
Plaintext
Raw Normal View History

include ../../_includes/_mixins
p
2017-04-16 21:35:56 +03:00
| This workflow describes how to train new statistical models for spaCy's
| part-of-speech tagger, named entity recognizer and dependency parser.
2017-04-16 21:35:56 +03:00
| Once the model is trained, you can then
| #[+a("/docs/usage/saving-loading") save and load] it.
+h(2, "train-pos-tagger") Training the part-of-speech tagger
+code.
from spacy.vocab import Vocab
2016-12-20 23:01:16 +03:00
from spacy.tagger import Tagger
from spacy.tokens import Doc
2016-12-20 23:01:16 +03:00
from spacy.gold import GoldParse
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab)
doc = Doc(vocab, words=['I', 'like', 'stuff'])
2016-12-20 23:01:16 +03:00
gold = GoldParse(doc, tags=['N', 'V', 'N'])
tagger.update(doc, gold)
tagger.model.end_training()
p
+button(gh("spaCy", "examples/training/train_tagger.py"), false, "secondary") Full example
+h(2, "train-entity") Training the named entity recognizer
+code.
from spacy.vocab import Vocab
from spacy.pipeline import EntityRecognizer
from spacy.tokens import Doc
from spacy.gold import GoldParse
vocab = Vocab()
entity = EntityRecognizer(vocab, entity_types=['PERSON', 'LOC'])
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
gold = GoldParse(doc, entities=['O', 'O', 'B-PERSON', 'L-PERSON', 'O'])
entity.update(doc, gold)
entity.model.end_training()
p
+button(gh("spaCy", "examples/training/train_ner.py"), false, "secondary") Full example
2017-04-16 21:35:56 +03:00
+h(2, "extend-entity") Extending the named entity recognizer
p
| All #[+a("/docs/usage/models") spaCy models] support online learning, so
| you can update a pre-trained model with new examples. You can even add
| new classes to an existing model, to recognise a new entity type,
| part-of-speech, or syntactic relation. Updating an existing model is
| particularly useful as a "quick and dirty solution", if you have only a
| few corrections or annotations.
p.o-inline-list
+button(gh("spaCy", "examples/training/train_new_entity_type.py"), true, "secondary") Full example
+button("/docs/usage/training-ner", false, "secondary") Usage Workflow
+h(2, "train-dependency") Training the dependency parser
+code.
from spacy.vocab import Vocab
from spacy.pipeline import DependencyParser
from spacy.tokens import Doc
from spacy.gold import GoldParse
vocab = Vocab()
parser = DependencyParser(vocab, labels=['nsubj', 'compound', 'dobj', 'punct'])
doc = Doc(vocab, words=['Who', 'is', 'Shaka', 'Khan', '?'])
gold = GoldParse(doc, [1,1,3,1,1], ['nsubj', 'ROOT', 'compound', 'dobj', 'punct'])
parser.update(doc, gold)
parser.model.end_training()
p
+button(gh("spaCy", "examples/training/train_parser.py"), false, "secondary") Full example
2017-04-16 21:35:56 +03:00
+h(2, "feature-templates") Customizing the feature extraction
p
| spaCy currently uses linear models for the tagger, parser and entity
| recognizer, with weights learned using the
| #[+a("https://explosion.ai/blog/part-of-speech-pos-tagger-in-python") Averaged Perceptron algorithm].
+aside("Linear Model Feature Scheme")
| For a list of the available feature atoms, see the #[+a("/docs/api/features") Linear Model Feature Scheme].
p
| Because it's a linear model, it's important for accuracy to build
| conjunction features out of the atomic predictors. Let's say you have
| two atomic predictors asking, "What is the part-of-speech of the
| previous token?", and "What is the part-of-speech of the previous
2017-03-23 23:28:57 +03:00
| previous token?". These predictors will introduce a number of features,
| e.g. #[code Prev-pos=NN], #[code Prev-pos=VBZ], etc. A conjunction
| template introduces features such as #[code Prev-pos=NN&Prev-pos=VBZ].
p
| The feature extraction proceeds in two passes. In the first pass, we
| fill an array with the values of all of the atomic predictors. In the
| second pass, we iterate over the feature templates, and fill a small
| temporary array with the predictors that will be combined into a
| conjunction feature. Finally, we hash this array into a 64-bit integer,
| using the MurmurHash algorithm. You can see this at work in the
| #[+a(gh("thinc", "thinc/linear/features.pyx", "94dbe06fd3c8f24d86ab0f5c7984e52dbfcdc6cb")) #[code thinc.linear.features]] module.
p
| It's very easy to change the feature templates, to create novel
| combinations of the existing atomic predictors. There's currently no API
| available to add new atomic predictors, though. You'll have to create a
| subclass of the model, and write your own #[code set_featuresC] method.
p
| The feature templates are passed in using the #[code features] keyword
| argument to the constructors of the #[+api("tagger") #[code Tagger]],
| #[+api("dependencyparser") #[code DependencyParser]] and
| #[+api("entityrecognizer") #[code EntityRecognizer]]:
+code.
from spacy.vocab import Vocab
from spacy.tagger import Tagger
from spacy.tagger import P2_orth, P1_orth
from spacy.tagger import P2_cluster, P1_cluster, W_orth, N1_orth, N2_orth
vocab = Vocab(tag_map={'N': {'pos': 'NOUN'}, 'V': {'pos': 'VERB'}})
tagger = Tagger(vocab, features=[(P2_orth, P2_cluster), (P1_orth, P1_cluster),
(P2_orth,), (P1_orth,), (W_orth,),
(N1_orth,), (N2_orth,)])
p
| Custom feature templates can be passed to the #[code DependencyParser]
| and #[code EntityRecognizer] as well, also using the #[code features]
| keyword argument of the constructor.