spaCy/website/docs/usage/v2.jade

532 lines
22 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > WHAT'S NEW IN V2.0
include ../../_includes/_mixins
p
| We're very excited to finally introduce spaCy v2.0! On this page, you'll
| find a summary of the new features, information on the backwards
| incompatibilities, including a handy overview of what's been renamed or
| deprecated. To help you make the most of v2.0, we also
| #[strong re-wrote almost all of the usage guides and API docs], and added
| more real-world examples. If you're new to spaCy, or just want to brush
| up on some NLP basics and the details of the library, check out
| the #[+a("/docs/usage/spacy-101") spaCy 101 guide] that explains the most
| important concepts with examples and illustrations.
+h(2, "summary") Summary
+grid.o-no-block
+grid-col("half")
p This release features
| entirely new #[strong deep learning-powered models] for spaCy's tagger,
| parser and entity recognizer. The new models are #[strong 20x smaller]
| than the linear models that have powered spaCy until now: from 300 MB to
| only 15 MB.
p
| We've also made several usability improvements that are
| particularly helpful for #[strong production deployments]. spaCy
| v2 now fully supports the Pickle protocol, making it easy to use
| spaCy with #[+a("https://spark.apache.org/") Apache Spark]. The
| string-to-integer mapping is #[strong no longer stateful], making
| it easy to reconcile annotations made in different processes.
| Models are smaller and use less memory, and the APIs for serialization
| are now much more consistent.
+table-of-contents
+item #[+a("#summary") Summary]
+item #[+a("#features") New features]
+item #[+a("#features-pipelines") Improved processing pipelines]
+item #[+a("#features-text-classification") Text classification]
+item #[+a("#features-hash-ids") Hash values instead of integer IDs]
+item #[+a("#features-serializer") Saving, loading and serialization]
+item #[+a("#features-displacy") displaCy visualizer]
+item #[+a("#features-language") Language data and lazy loading]
+item #[+a("#features-matcher") Revised matcher API]
+item #[+a("#features-models") Neural network models]
+item #[+a("#incompat") Backwards incompatibilities]
+item #[+a("#migrating") Migrating from spaCy v1.x]
+item #[+a("#benchmarks") Benchmarks]
p
| The main usability improvements you'll notice in spaCy v2.0 are around
| #[strong defining, training and loading your own models] and components.
| The new neural network models make it much easier to train a model from
| scratch, or update an existing model with a few examples. In v1.x, the
| statistical models depended on the state of the #[code Vocab]. If you
| taught the model a new word, you would have to save and load a lot of
| data — otherwise the model wouldn't correctly recall the features of your
| new example. That's no longer the case.
p
| Due to some clever use of hashing, the statistical models
| #[strong never change size], even as they learn new vocabulary items.
| The whole pipeline is also now fully differentiable. Even if you don't
| have explicitly annotated data, you can update spaCy using all the
| #[strong latest deep learning tricks] like adversarial training, noise
| contrastive estimation or reinforcement learning.
+h(2, "features") New features
p
| This section contains an overview of the most important
| #[strong new features and improvements]. The #[+a("/docs/api") API docs]
| include additional deprecation notes. New methods and functions that
| were introduced in this version are marked with a #[+tag-new(2)] tag.
+h(3, "features-pipelines") Improved processing pipelines
+aside-code("Example").
# Modify an existing pipeline
nlp = spacy.load('en')
nlp.pipeline.append(my_component)
# Register a factory to create a component
spacy.set_factory('my_factory', my_factory)
nlp = Language(pipeline=['my_factory', mycomponent])
p
| It's now much easier to #[strong customise the pipeline] with your own
| components, functions that receive a #[code Doc] object, modify and
| return it. If your component is stateful, you can define and register a
| factory which receives the shared #[code Vocab] object and returns a
|  component. spaCy's default components can be added to your pipeline by
| using their string IDs. This way, you won't have to worry about finding
| and implementing them simply add #[code "tagger"] to the pipeline,
| and spaCy will know what to do.
+image
include ../../assets/img/docs/pipeline.svg
+infobox
| #[strong API:] #[+api("language") #[code Language]]
| #[strong Usage:] #[+a("/docs/usage/language-processing-pipeline") Processing text]
+h(3, "features-text-classification") Text classification
+aside-code("Example").
from spacy.lang.en import English
nlp = English(pipeline=['tensorizer', 'tagger', 'textcat'])
p
| spaCy v2.0 lets you add text categorization models to spaCy pipelines.
| The model supports classification with multiple, non-mutually exclusive
| labels so multiple labels can apply at once. You can change the model
| architecture rather easily, but by default, the #[code TextCategorizer]
| class uses a convolutional neural network to assign position-sensitive
| vectors to each word in the document.
+infobox
| #[strong API:] #[+api("textcategorizer") #[code TextCategorizer]],
| #[+api("doc#attributes") #[code Doc.cats]],
| #[+api("goldparse#attributes") #[code GoldParse.cats]]#[br]
| #[strong Usage:] #[+a("/docs/usage/text-classification") Text classification]
+h(3, "features-hash-ids") Hash values instead of integer IDs
+aside-code("Example").
doc = nlp(u'I love coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'
beer_hash = doc.vocab.strings.add(u'beer')
assert doc.vocab.strings[u'beer'] == beer_hash
assert doc.vocab.strings[beer_hash] == u'beer'
p
| The #[+api("stringstore") #[code StringStore]] now resolves all strings
| to hash values instead of integer IDs. This means that the string-to-int
| mapping #[strong no longer depends on the vocabulary state], making a lot
| of workflows much simpler, especially during training. Unlike integer IDs
| in spaCy v1.x, hash values will #[strong always match] even across
| models. Strings can now be added explicitly using the new
| #[+api("stringstore#add") #[code Stringstore.add]] method. A token's hash
| is available via #[code token.orth].
+infobox
| #[strong API:] #[+api("stringstore") #[code StringStore]]
| #[strong Usage:] #[+a("/docs/usage/spacy-101#vocab") Vocab, hashes and lexemes 101]
+h(3, "features-serializer") Saving, loading and serialization
+aside-code("Example").
nlp = spacy.load('en') # shortcut link
nlp = spacy.load('en_core_web_sm') # package
nlp = spacy.load('/path/to/en') # unicode path
nlp = spacy.load(Path('/path/to/en')) # pathlib Path
nlp.to_disk('/path/to/nlp')
nlp = English().from_disk('/path/to/nlp')
p
| spay's serialization API has been made consistent across classes and
| objects. All container classes, i.e. #[code Language], #[code Doc],
| #[code Vocab] and #[code StringStore] now have a #[code to_bytes()],
| #[code from_bytes()], #[code to_disk()] and #[code from_disk()] method
| that supports the Pickle protocol.
p
| The improved #[code spacy.load] makes loading models easier and more
| transparent. You can load a model by supplying its
| #[+a("/docs/usage/models#usage") shortcut link], the name of an installed
| #[+a("/docs/usage/saving-loading#generating") model package] or a path.
| The #[code Language] class to initialise will be determined based on the
| model's settings. For a blank language, you can import the class directly,
| e.g. #[code from spacy.lang.en import English].
+infobox
| #[strong API:] #[+api("spacy#load") #[code spacy.load]], #[+api("binder") #[code Binder]]
| #[strong Usage:] #[+a("/docs/usage/saving-loading") Saving and loading]
+h(3, "features-displacy") displaCy visualizer with Jupyter support
+aside-code("Example").
from spacy import displacy
doc = nlp(u'This is a sentence about Facebook.')
displacy.serve(doc, style='dep') # run the web server
html = displacy.render(doc, style='ent') # generate HTML
p
| Our popular dependency and named entity visualizers are now an official
| part of the spaCy library! displaCy can run a simple web server, or
| generate raw HTML markup or SVG files to be exported. You can pass in one
| or more docs, and customise the style. displaCy also auto-detects whether
| you're running #[+a("https://jupyter.org") Jupyter] and will render the
| visualizations in your notebook.
+infobox
| #[strong API:] #[+api("displacy") #[code displacy]]
| #[strong Usage:] #[+a("/docs/usage/visualizers") Visualizing spaCy]
+h(3, "features-language") Improved language data and lazy loading
p
| Language-specfic data now lives in its own submodule, #[code spacy.lang].
| Languages are lazy-loaded, i.e. only loaded when you import a
| #[code Language] class, or load a model that initialises one. This allows
| languages to contain more custom data, e.g. lemmatizer lookup tables, or
| complex regular expressions. The language data has also been tidied up
| and simplified. spaCy now also supports simple lookup-based lemmatization.
+infobox
| #[strong API:] #[+api("language") #[code Language]]
| #[strong Code:] #[+src(gh("spaCy", "spacy/lang")) spacy/lang]
| #[strong Usage:] #[+a("/docs/usage/adding-languages") Adding languages]
+h(3, "features-matcher") Revised matcher API
+aside-code("Example").
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('HEARTS', None, [{'ORTH': '❤️', 'OP': '+'}])
assert len(matcher) == 1
assert 'HEARTS' in matcher
p
| Patterns can now be added to the matcher by calling
| #[+api("matcher-add") #[code matcher.add()]] with a match ID, an optional
| callback function to be invoked on each match, and one or more patterns.
| This allows you to write powerful, pattern-specific logic using only one
| matcher. For example, you might only want to merge some entity types,
| and set custom flags for other matched patterns.
+infobox
| #[strong API:] #[+api("matcher") #[code Matcher]]
| #[strong Usage:] #[+a("/docs/usage/rule-based-matching") Rule-based matching]
+h(3, "features-models") Neural network models for English, German, French, Spanish and multi-language NER
+aside-code("Example", "bash").
python -m spacy download en # default English model
python -m spacy download de # default German model
python -m spacy download fr # default French model
python -m spacy download es # default Spanish model
python -m spacy download xx_ent_wiki_sm # multi-language NER
p
| spaCy v2.0 comes with new and improved neural network models for English,
| German, French and Spanish, as well as a multi-language named entity
| recognition model trained on Wikipedia. #[strong GPU usage] is now
| supported via #[+a("http://chainer.org") Chainer]'s CuPy module.
+infobox
| #[strong Details:] #[+a("/docs/api/language-models") Languages],
| #[+src(gh("spacy-models")) spacy-models]
| #[strong Usage:] #[+a("/docs/usage/models") Models],
| #[+a("/docs/usage#gpu") Using spaCy with GPU]
+h(2, "incompat") Backwards incompatibilities
+table(["Old", "New"])
+row
+cell
| #[code spacy.en]
| #[code spacy.xx]
+cell
| #[code spacy.lang.en]
| #[code spacy.lang.xx]
+row
+cell #[code orth]
+cell #[code lang.xx.lex_attrs]
+row
+cell #[code syntax.iterators]
+cell #[code lang.xx.syntax_iterators]
+row
+cell #[code Language.save_to_directory]
+cell #[+api("language#to_disk") #[code Language.to_disk]]
+row
+cell #[code Language.create_make_doc]
+cell #[+api("language#attributes") #[code Language.tokenizer]]
+row
+cell
| #[code Vocab.load]
| #[code Vocab.load_lexemes]
+cell
| #[+api("vocab#from_disk") #[code Vocab.from_disk]]
| #[+api("vocab#from_bytes") #[code Vocab.from_bytes]]
+row
+cell
| #[code Vocab.dump]
+cell
| #[+api("vocab#to_disk") #[code Vocab.to_disk]]#[br]
| #[+api("vocab#to_bytes") #[code Vocab.to_bytes]]
+row
+cell
| #[code Vocab.load_vectors]
| #[code Vocab.load_vectors_from_bin_loc]
+cell
| #[+api("vectors#from_disk") #[code Vectors.from_disk]]
| #[+api("vectors#from_bytes") #[code Vectors.from_bytes]]
+row
+cell
| #[code Vocab.dump_vectors]
+cell
| #[+api("vectors#to_disk") #[code Vectors.to_disk]]
| #[+api("vectors#to_bytes") #[code Vectors.to_bytes]]
+row
+cell
| #[code StringStore.load]
+cell
| #[+api("stringstore#from_disk") #[code StringStore.from_disk]]
| #[+api("stringstore#from_bytes") #[code StringStore.from_bytes]]
+row
+cell
| #[code StringStore.dump]
+cell
| #[+api("stringstore#to_disk") #[code StringStore.to_disk]]
| #[+api("stringstore#to_bytes") #[code StringStore.to_bytes]]
+row
+cell #[code Tokenizer.load]
+cell
| #[+api("tokenizer#from_disk") #[code Tokenizer.from_disk]]
| #[+api("tokenizer#from_bytes") #[code Tokenizer.from_bytes]]
+row
+cell #[code Tagger.load]
+cell
| #[+api("tagger#from_disk") #[code Tagger.from_disk]]
| #[+api("tagger#from_bytes") #[code Tagger.from_bytes]]
+row
+cell #[code DependencyParser.load]
+cell
| #[+api("dependencyparser#from_disk") #[code DependencyParser.from_disk]]
| #[+api("dependencyparser#from_bytes") #[code DependencyParser.from_bytes]]
+row
+cell #[code EntityRecognizer.load]
+cell
| #[+api("entityrecognizer#from_disk") #[code EntityRecognizer.from_disk]]
| #[+api("entityrecognizer#from_bytes") #[code EntityRecognizer.from_bytes]]
+row
+cell #[code Matcher.load]
+cell -
+row
+cell
| #[code Matcher.add_pattern]
| #[code Matcher.add_entity]
+cell #[+api("matcher#add") #[code Matcher.add]]
+row
+cell #[code Matcher.get_entity]
+cell #[+api("matcher#get") #[code Matcher.get]]
+row
+cell #[code Matcher.has_entity]
+cell #[+api("matcher#contains") #[code Matcher.__contains__]]
+row
+cell #[code Doc.read_bytes]
+cell #[+api("binder") #[code Binder]]
+row
+cell #[code Token.is_ancestor_of]
+cell #[+api("token#is_ancestor") #[code Token.is_ancestor]]
+row
+cell #[code cli.model]
+cell -
+h(2, "migrating") Migrating from spaCy 1.x
p
| Because we'e made so many architectural changes to the library, we've
| tried to #[strong keep breaking changes to a minimum]. A lot of projects
| follow the philosophy that if you're going to break anything, you may as
| well break everything. We think migration is easier if there's a logic to
| what has changed.
p
| We've therefore followed a policy of avoiding breaking changes to the
| #[code Doc], #[code Span] and #[code Token] objects. This way, you can
| focus on only migrating the code that does training, loading and
| serialization — in other words, code that works with the #[code nlp]
| object directly. Code that uses the annotations should continue to work.
+infobox("Important note")
| If you've trained your own models, keep in mind that your train and
| runtime inputs must match. This means you'll have to
| #[strong retrain your models] with spaCy v2.0.
+h(3, "migrating-saving-loading") Saving, loading and serialization
p
| Double-check all calls to #[code spacy.load()] and make sure they don't
| use the #[code path] keyword argument. If you're only loading in binary
| data and not a model package that can construct its own #[code Language]
| class and pipeline, you should now use the
| #[+api("language#from_disk") #[code Language.from_disk()]] method.
+code-new.
nlp = spacy.load('/model')
nlp = English().from_disk('/model/data')
+code-old nlp = spacy.load('en', path='/model')
p
| Review all other code that writes state to disk or bytes.
| All containers, now share the same, consistent API for saving and
| loading. Replace saving with #[code to_disk()] or #[code to_bytes()], and
| loading with #[code from_disk()] and #[code from_bytes()].
+code-new.
nlp.to_disk('/model')
nlp.vocab.to_disk('/vocab')
+code-old.
nlp.save_to_directory('/model')
nlp.vocab.dump('/vocab')
p
| If you've trained models with input from v1.x, you'll need to
| #[strong retrain them] with spaCy v2.0. All previous models will not
| be compatible with the new version.
+h(3, "migrating-strings") Strings and hash values
p
| The change from integer IDs to hash values may not actually affect your
| code very much. However, if you're adding strings to the vocab manually,
| you now need to call #[+api("stringstore#add") #[code StringStore.add()]]
| explicitly. You can also now be sure that the string-to-hash mapping will
| always match across vocabularies.
+code-new.
nlp.vocab.strings.add(u'coffee')
nlp.vocab.strings[u'coffee'] # 3197928453018144401
other_nlp.vocab.strings[u'coffee'] # 3197928453018144401
+code-old.
nlp.vocab.strings[u'coffee'] # 3672
other_nlp.vocab.strings[u'coffee'] # 40259
+h(3, "migrating-languages") Processing pipelines and language data
p
| If you're importing language data or #[code Language] classes, make sure
| to change your import statements to import from #[code spacy.lang]. If
| you've added your own custom language, it needs to be moved to
| #[code spacy/lang/xx] and adjusted accordingly.
+code-new from spacy.lang.en import English
+code-old from spacy.en import English
p
| If you've been using custom pipeline components, check out the new
| guide on #[+a("/docs/usage/language-processing-pipelines") processing pipelines].
| Appending functions to the pipeline still works but you might be able
| to make this more convenient by registering "component factories".
| Components of the processing pipeline can now be disabled by passing a
| list of their names to the #[code disable] keyword argument on loading
| or processing.
+code-new.
nlp = spacy.load('en', disable=['tagger', 'ner'])
doc = nlp(u"I don't want parsed", disable=['parser'])
+code-old.
nlp = spacy.load('en', tagger=False, entity=False)
doc = nlp(u"I don't want parsed", parse=False)
+h(3, "migrating-matcher") Adding patterns and callbacks to the matcher
p
| If you're using the matcher, you can now add patterns in one step. This
| should be easy to update simply merge the ID, callback and patterns
| into one call to #[+api("matcher#add") #[code Matcher.add()]].
+code-new.
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
+code-old.
matcher.add_entity('GoogleNow', on_match=merge_phrases)
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
p
| If you've been using #[strong acceptor functions], you'll need to move
| this logic into the
| #[+a("/docs/usage/rule-based-matching#on_match") #[code on_match] callbacks].
| The callback function is invoked on every match and will give you access to
| the doc, the index of the current match and all total matches. This lets
| you both accept or reject the match, and define the actions to be
| triggered.
+h(2, "benchmarks") Benchmarks
+under-construction
+aside("Data sources")
| #[strong Parser, tagger, NER:] #[+a("https://www.gabormelli.com/RKB/OntoNotes_Corpus") OntoNotes 5]#[br]
| #[strong Word vectors:] #[+a("http://commoncrawl.org") Common Crawl]#[br]
p The evaluation was conducted on raw text with no gold standard information.
+table(["Model", "Version", "Type", "UAS", "LAS", "NER F", "POS", "w/s"])
mixin benchmark-row(name, details, values, highlight, style)
+row(style)
+cell #[code=name]
for cell in details
+cell=cell
for cell, i in values
+cell.u-text-right
if highlight && highlight[i]
strong=cell
else
!=cell
+benchmark-row("en_core_web_sm", ["2.0.0", "neural"], ["91.2", "89.2", "82.6", "96.6", "10,300"], [1, 1, 1, 0, 0])
+benchmark-row("en_core_web_sm", ["1.2.0", "linear"], ["86.6", "83.8", "78.5", "96.6", "25,700"], [0, 0, 0, 0, 1], "divider")
+benchmark-row("en_core_web_md", ["1.2.1", "linear"], ["90.6", "88.5", "81.4", "96.7", "18,800"], [0, 0, 0, 1, 0])