diff --git a/website/docs/usage/v2-2.md b/website/docs/usage/v2-2.md new file mode 100644 index 000000000..2109ae812 --- /dev/null +++ b/website/docs/usage/v2-2.md @@ -0,0 +1,292 @@ +--- +title: What's New in v2.2 +teaser: New features, backwards incompatibilities and migration guide +menu: + - ['New Features', 'features'] + - ['Backwards Incompatibilities', 'incompat'] +--- + +## New Features {#features hidden="true"} + + + +### Better pretrained models and more languages {#models} + +> #### Example +> +> ```python +> python -m spacy download nl_core_news_sm +> python -m spacy download nb_core_news_sm +> python -m spacy download nb_core_news_md +> python -m spacy download lt_core_news_sm +> ``` + +The new version also features new and re-trained models for all languages and +resolves a number of data bugs. The [Dutch model](/models/nl) has been retrained +with a new and custom-labelled NER corpus using the same extended label scheme +as the English models. It should now produce significantly better NER results +overall. We've also added new core models for [Norwegian](/models/nb) (MIT) and +[Lithuanian](/models/lt) (CC BY-SA). + + + +**Usage:** [Models directory](/models) **Benchmarks: ** +[Release notes](https://github.com/explosion/spaCy/releases/tag/v2.2.0) + + + +### Entity linking API {#entity-linking} + +> #### Example +> +> ```python +> nlp = spacy.load("my_custom_wikidata_model") +> doc = nlp("Ada Lovelace was born in London") +> print([(e.text, e.label_, e.kb_id_) for e in doc.ents]) +> # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')] +> ``` + +Entity linking lets you ground named entities into the "real world". We're +excited to now provide a built-in API for training entity linking models and +resolving textual entities to unique identifiers from a knowledge base. The +annotated KB identifier is accessible as either a hash value or as a string from +a `Span` or `Token` object. For more details on entity linking in spaCy, check +out +[Sofie's talk](https://www.youtube.com/watch?v=PW3RJM8tDGo&list=PLBmcuObd5An4UC6jvK_-eSl6jCvP1gwXc&index=6) +at spaCy IRL 2019. + + + +**API:** [`EntityLinker`](/api/entitylinker), +[`KnowledgeBase`](/api/knowledgebase) **Code: ** +[`bin/wiki_entity_linking`](https://github.com/explosion/spaCy/tree/master/bin/wiki_entity_linking) +**Usage: ** [Entity linking](/usage/linguistic-features#entity-linking), +[Training an entity linking model](/usage/training#entity-linker) + + + +### Serializable lookup table and dictionary API {#lookups} + +> #### Example +> +> ```python +> data = {"foo": "bar"} +> nlp.vocab.lookups.add_table("my_dict", data) +> +> def custom_component(doc): +> table = doc.vocab.lookups.get_table("my_dict") +> print(table.get("foo")) # look something up +> return doc +> ``` + +The new `Lookups` API lets you add large dictionaries and lookup tables to the +`Vocab` and access them from the tokenizer or custom components and extension +attributes. Internally, the tables use Bloom filters for efficient lookup +checks. They're also fully serializable out-of-the-box. All large data resources +included with spaCy now use this API and are additionally compressed at build +time. This allowed us to make the installed library roughly **15 times smaller +on disk**. + + + +**API:** [`Lookups`](/api/lookups) **Usage: ** +[Adding languages: Lemmatizer](/usage/adding-languages#lemmatizer) + + + +### Text classification scores and CLI training {#train-textcat-cli} + +> #### Example +> +> ```python +> scorer = nlp.evaluate(dev_data) +> print(scorer.textcat_scores, scorer.textcats_per_cat) +> ``` + +When training your models using the `spacy train` command, you can now also +include text categories in the JSON-formatted training data. The `Scorer` and +`nlp.evaluate` now report the text classification scores, calculated as the +F-score on positive label for binary exclusive tasks, the macro-averaged F-score +for 3+ exclusive labels or the macro-averaged AUC ROC score for multilabel +classification. + + + +**API:** [`spacy train`](/api/cli#train), [`Scorer`](/api/scorer), +[`Language.evaluate`](/api/language#evaluate) + + + +### CLI command to debug and validate training data {#debug-data} + +> #### Example +> +> ```bash +> $ python -m spacy debug-data en train.json dev.json +> ``` + +The new `debug-data` command lets you analyze and validate your training and +development data, get useful stats, and find problems like invalid entity +annotations, cyclic dependencies, low data labels and more. If you're training a +model with `spacy train` and the results seem surprising or confusing, +`debug-data` may help you track down the problems and improve your training +data. + + + +``` +=========================== Data format validation =========================== +✔ Corpus is loadable + +=============================== Training stats =============================== +Training pipeline: tagger, parser, ner +Starting with blank model 'en' +18127 training docs +2939 evaluation docs +⚠ 34 training examples also in evaluation data + +============================== Vocab & Vectors ============================== +ℹ 2083156 total words in the data (56962 unique) +⚠ 13020 misaligned tokens in the training data +⚠ 2423 misaligned tokens in the dev data +10 most common words: 'the' (98429), ',' (91756), '.' (87073), 'to' (50058), +'of' (49559), 'and' (44416), 'a' (34010), 'in' (31424), 'that' (22792), 'is' +(18952) +ℹ No word vectors present in the model + +========================== Named Entity Recognition ========================== +ℹ 18 new labels, 0 existing labels +528978 missing values (tokens with '-' label) +New: 'ORG' (23860), 'PERSON' (21395), 'GPE' (21193), 'DATE' (18080), 'CARDINAL' +(10490), 'NORP' (9033), 'MONEY' (5164), 'PERCENT' (3761), 'ORDINAL' (2122), +'LOC' (2113), 'TIME' (1616), 'WORK_OF_ART' (1229), 'QUANTITY' (1150), 'FAC' +(1134), 'EVENT' (974), 'PRODUCT' (935), 'LAW' (444), 'LANGUAGE' (338) +✔ Good amount of examples for all labels +✔ Examples without occurences available for all labels +✔ No entities consisting of or starting/ending with whitespace + +=========================== Part-of-speech Tagging =========================== +ℹ 49 labels in data (57 labels in tag map) +'NN' (266331), 'IN' (227365), 'DT' (185600), 'NNP' (164404), 'JJ' (119830), +'NNS' (110957), '.' (101482), ',' (92476), 'RB' (90090), 'PRP' (90081), 'VB' +(74538), 'VBD' (68199), 'CC' (62862), 'VBZ' (50712), 'VBP' (43420), 'VBN' +(42193), 'CD' (40326), 'VBG' (34764), 'TO' (31085), 'MD' (25863), 'PRP$' +(23335), 'HYPH' (13833), 'POS' (13427), 'UH' (13322), 'WP' (10423), 'WDT' +(9850), 'RP' (8230), 'WRB' (8201), ':' (8168), '''' (7392), '``' (6984), 'NNPS' +(5817), 'JJR' (5689), '$' (3710), 'EX' (3465), 'JJS' (3118), 'RBR' (2872), +'-RRB-' (2825), '-LRB-' (2788), 'PDT' (2078), 'XX' (1316), 'RBS' (1142), 'FW' +(794), 'NFP' (557), 'SYM' (440), 'WP$' (294), 'LS' (293), 'ADD' (191), 'AFX' +(24) +✔ All labels present in tag map for language 'en' + +============================= Dependency Parsing ============================= +ℹ Found 111703 sentences with an average length of 18.6 words. +ℹ Found 2251 nonprojective train sentences +ℹ Found 303 nonprojective dev sentences +ℹ 47 labels in train data +ℹ 211 labels in projectivized train data +'punct' (236796), 'prep' (188853), 'pobj' (182533), 'det' (172674), 'nsubj' +(169481), 'compound' (116142), 'ROOT' (111697), 'amod' (107945), 'dobj' (93540), +'aux' (86802), 'advmod' (86197), 'cc' (62679), 'conj' (59575), 'poss' (36449), +'ccomp' (36343), 'advcl' (29017), 'mark' (27990), 'nummod' (24582), 'relcl' +(21359), 'xcomp' (21081), 'attr' (18347), 'npadvmod' (17740), 'acomp' (17204), +'auxpass' (15639), 'appos' (15368), 'neg' (15266), 'nsubjpass' (13922), 'case' +(13408), 'acl' (12574), 'pcomp' (10340), 'nmod' (9736), 'intj' (9285), 'prt' +(8196), 'quantmod' (7403), 'dep' (4300), 'dative' (4091), 'agent' (3908), 'expl' +(3456), 'parataxis' (3099), 'oprd' (2326), 'predet' (1946), 'csubj' (1494), +'subtok' (1147), 'preconj' (692), 'meta' (469), 'csubjpass' (64), 'iobj' (1) +⚠ Low number of examples for label 'iobj' (1) +⚠ Low number of examples for 130 labels in the projectivized dependency +trees used for training. You may want to projectivize labels such as punct +before training in order to improve parser performance. +⚠ Projectivized labels with low numbers of examples: appos||attr: 12 +advmod||dobj: 13 prep||ccomp: 12 nsubjpass||ccomp: 15 pcomp||prep: 14 +amod||dobj: 9 attr||xcomp: 14 nmod||nsubj: 17 prep||advcl: 2 prep||prep: 5 +nsubj||conj: 12 advcl||advmod: 18 ccomp||advmod: 11 ccomp||pcomp: 5 acl||pobj: +10 npadvmod||acomp: 7 dobj||pcomp: 14 nsubjpass||pcomp: 1 nmod||pobj: 8 +amod||attr: 6 nmod||dobj: 12 aux||conj: 1 neg||conj: 1 dative||xcomp: 11 +pobj||dative: 3 xcomp||acomp: 19 advcl||pobj: 2 nsubj||advcl: 2 csubj||ccomp: 1 +advcl||acl: 1 relcl||nmod: 2 dobj||advcl: 10 advmod||advcl: 3 nmod||nsubjpass: 6 +amod||pobj: 5 cc||neg: 1 attr||ccomp: 16 advcl||xcomp: 3 nmod||attr: 4 +advcl||nsubjpass: 5 advcl||ccomp: 4 ccomp||conj: 1 punct||acl: 1 meta||acl: 1 +parataxis||acl: 1 prep||acl: 1 amod||nsubj: 7 ccomp||ccomp: 3 acomp||xcomp: 5 +dobj||acl: 5 prep||oprd: 6 advmod||acl: 2 dative||advcl: 1 pobj||agent: 5 +xcomp||amod: 1 dep||advcl: 1 prep||amod: 8 relcl||compound: 1 advcl||csubj: 3 +npadvmod||conj: 2 npadvmod||xcomp: 4 advmod||nsubj: 3 ccomp||amod: 7 +advcl||conj: 1 nmod||conj: 2 advmod||nsubjpass: 2 dep||xcomp: 2 appos||ccomp: 1 +advmod||dep: 1 advmod||advmod: 5 aux||xcomp: 8 dep||advmod: 1 dative||ccomp: 2 +prep||dep: 1 conj||conj: 1 dep||ccomp: 4 cc||ROOT: 1 prep||ROOT: 1 nsubj||pcomp: +3 advmod||prep: 2 relcl||dative: 1 acl||conj: 1 advcl||attr: 4 prep||npadvmod: 1 +nsubjpass||xcomp: 1 neg||advmod: 1 xcomp||oprd: 1 advcl||advcl: 1 dobj||dep: 3 +nsubjpass||parataxis: 1 attr||pcomp: 1 ccomp||parataxis: 1 advmod||attr: 1 +nmod||oprd: 1 appos||nmod: 2 advmod||relcl: 1 appos||npadvmod: 1 appos||conj: 1 +prep||expl: 1 nsubjpass||conj: 1 punct||pobj: 1 cc||pobj: 1 conj||pobj: 1 +punct||conj: 1 ccomp||dep: 1 oprd||xcomp: 3 ccomp||xcomp: 1 ccomp||nsubj: 1 +nmod||dep: 1 xcomp||ccomp: 1 acomp||advcl: 1 intj||advmod: 1 advmod||acomp: 2 +relcl||oprd: 1 advmod||prt: 1 advmod||pobj: 1 appos||nummod: 1 relcl||npadvmod: +3 mark||advcl: 1 aux||ccomp: 1 amod||nsubjpass: 1 npadvmod||advmod: 1 conj||dep: +1 nummod||pobj: 1 amod||npadvmod: 1 intj||pobj: 1 nummod||npadvmod: 1 +xcomp||xcomp: 1 aux||dep: 1 advcl||relcl: 1 +⚠ The following labels were found only in the train data: xcomp||amod, +advcl||relcl, prep||nsubjpass, acl||nsubj, nsubjpass||conj, xcomp||oprd, +advmod||conj, advmod||advmod, iobj, advmod||nsubjpass, dobj||conj, ccomp||amod, +meta||acl, xcomp||xcomp, prep||attr, prep||ccomp, advcl||acomp, acl||dobj, +advcl||advcl, pobj||agent, prep||advcl, nsubjpass||xcomp, prep||dep, +acomp||xcomp, aux||ccomp, ccomp||dep, conj||dep, relcl||compound, +nsubjpass||ccomp, nmod||dobj, advmod||advcl, advmod||acl, dobj||advcl, +dative||xcomp, prep||nsubj, ccomp||ccomp, nsubj||ccomp, xcomp||acomp, +prep||acomp, dep||advmod, acl||pobj, appos||dobj, npadvmod||acomp, cc||ROOT, +relcl||nsubj, nmod||pobj, acl||nsubjpass, ccomp||advmod, pcomp||prep, +amod||dobj, advmod||attr, advcl||csubj, appos||attr, dobj||pcomp, prep||ROOT, +relcl||pobj, advmod||pobj, amod||nsubj, ccomp||xcomp, prep||oprd, +npadvmod||advmod, appos||nummod, advcl||pobj, neg||advmod, acl||attr, +appos||nsubjpass, csubj||ccomp, amod||nsubjpass, intj||pobj, dep||advcl, +cc||neg, xcomp||ccomp, dative||ccomp, nmod||oprd, pobj||dative, prep||dobj, +dep||ccomp, relcl||attr, ccomp||nsubj, advcl||xcomp, nmod||dep, advcl||advmod, +ccomp||conj, pobj||prep, advmod||acomp, advmod||relcl, attr||pcomp, +ccomp||parataxis, oprd||xcomp, intj||advmod, nmod||nsubjpass, prep||npadvmod, +parataxis||acl, prep||pobj, advcl||dobj, amod||pobj, prep||acl, conj||pobj, +advmod||dep, punct||pobj, ccomp||acomp, acomp||advcl, nummod||npadvmod, +dobj||dep, npadvmod||xcomp, advcl||conj, relcl||npadvmod, punct||acl, +relcl||dobj, dobj||xcomp, nsubjpass||parataxis, dative||advcl, relcl||nmod, +advcl||ccomp, appos||npadvmod, ccomp||pcomp, prep||amod, mark||advcl, +prep||advmod, prep||xcomp, appos||nsubj, attr||ccomp, advmod||prt, dobj||ccomp, +aux||conj, advcl||nsubj, conj||conj, advmod||ccomp, advcl||nsubjpass, +attr||xcomp, nmod||conj, npadvmod||conj, relcl||dative, prep||expl, +nsubjpass||pcomp, advmod||xcomp, advmod||dobj, appos||pobj, nsubj||conj, +relcl||nsubjpass, advcl||attr, appos||ccomp, advmod||prep, prep||conj, +nmod||attr, punct||conj, neg||conj, dep||xcomp, aux||xcomp, dobj||acl, +nummod||pobj, amod||npadvmod, nsubj||pcomp, advcl||acl, appos||nmod, +relcl||oprd, prep||prep, cc||pobj, nmod||nsubj, amod||attr, aux||dep, +appos||conj, advmod||nsubj, nsubj||advcl, acl||conj +To train a parser, your data should include at least 20 instances of each label. +⚠ Multiple root labels (ROOT, nsubj, aux, npadvmod, prep) found in +training data. spaCy's parser uses a single root label ROOT so this distinction +will not be available. + +================================== Summary ================================== +✔ 5 checks passed +⚠ 8 warnings +``` + + + + + +**API:** [`spacy debug-data`](/api/cli#debug-data) + + + +## Backwards incompatibilities {#incompat} + + + +If you've been training **your own models**, you'll need to **retrain** them +with the new version. Also don't forget to upgrade all models to the latest +versions. Models for v2.0 or v2.1 aren't compatible with models for v2.2. To +check if all of your models are up to date, you can run the +[`spacy validate`](/api/cli#validate) command. + + + + diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index a05440e5a..7c6affe70 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -9,6 +9,7 @@ { "text": "Models & Languages", "url": "/usage/models" }, { "text": "Facts & Figures", "url": "/usage/facts-figures" }, { "text": "spaCy 101", "url": "/usage/spacy-101" }, + { "text": "New in v2.2", "url": "/usage/v2-2" }, { "text": "New in v2.1", "url": "/usage/v2-1" }, { "text": "New in v2.0", "url": "/usage/v2" } ]