Merge remote-tracking branch 'remotes/upstream/master'

2025-11-28 13:55:39 +03:00 · 2017-04-27 14:42:54 +02:00 · 2017-04-27 14:42:54 +02:00 · c0afcd22bb
commit c0afcd22bb
parent 92f368f83b f26a3b5a50
40 changed files with 929 additions and 148 deletions
--- a/.github/contributors/luvogels.md
+++ b/.github/contributors/luvogels.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI UG (haftungsbeschränkt)](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your 
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Leif Uwe Vogelsang   |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 4/27/2017            |
 | GitHub username                | luvogels             |
 | Website (optional)             |                      |
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@ -40,7 +40,7 @@ To distinguish issues that are opened by us, the maintainers, we usually add a
 | [`performance`](https://github.com/explosion/spaCy/labels/performance) | Accuracy, speed and memory use problems |
 | [`tests`](https://github.com/explosion/spaCy/labels/tests) | Missing or incorrect [tests](spacy/tests) |
 | [`docs`](https://github.com/explosion/spaCy/labels/docs), [`examples`](https://github.com/explosion/spaCy/labels/examples) | Issues related to the [documentation](https://spacy.io/docs) and [examples](spacy/examples) |
-| [`models`](https://github.com/explosion/spaCy/labels/models), [`english`](https://github.com/explosion/spaCy/labels/english), [`german`](https://github.com/explosion/spaCy/labels/german) | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
+| [`models`](https://github.com/explosion/spaCy/labels/models), `language / [name]` | Issues related to the specific [models](https://github.com/explosion/spacy-models), languages and data |
 | [`linux`](https://github.com/explosion/spaCy/labels/linux), [`osx`](https://github.com/explosion/spaCy/labels/osx), [`windows`](https://github.com/explosion/spaCy/labels/windows) | Issues related to the specific operating systems |
 | [`pip`](https://github.com/explosion/spaCy/labels/pip), [`conda`](https://github.com/explosion/spaCy/labels/conda) | Issues related to the specific package managers |
 | [`duplicate`](https://github.com/explosion/spaCy/labels/duplicate) | Duplicates, i.e. issues that have been reported before |
--- a/CONTRIBUTORS.md
+++ b/CONTRIBUTORS.md
@ -29,6 +29,7 @@ This is a list of everyone who has made significant contributions to spaCy, in a
 * Juan Miguel Cejuela, [@juanmirocks](https://github.com/juanmirocks)
 * Kendrick Tan, [@kendricktan](https://github.com/kendricktan)
 * Kyle P. Johnson, [@kylepjohnson](https://github.com/kylepjohnson)
 * Leif Uwe Vogelsang, [@luvogels](https://github.com/luvogels)
 * Liling Tan, [@alvations](https://github.com/alvations)
 * Magnus Burton, [@magnusburton](https://github.com/magnusburton)
 * Mark Amery, [@ExplodingCabbage](https://github.com/ExplodingCabbage)
--- a/README.rst
+++ b/README.rst
@ -4,9 +4,9 @@ spaCy: Industrial-strength NLP
 spaCy is a library for advanced natural language processing in Python and
 Cython. spaCy is built on  the very latest research, but it isn't researchware.
 It was designed from day one to be used in real products. spaCy currently supports
-English and German,  as well as tokenization for Chinese, Spanish, Italian, French,
+English, German and French, as well as tokenization for Chinese, Spanish, Italian,
-Portuguese, Dutch, Swedish, Finnish, Hungarian, Bengali and Hebrew. It's commercial
+Portuguese, Dutch, Swedish, Finnish, Norwegian, Hungarian, Bengali and Hebrew. It's 
-open-source software, released under the MIT license.
+commercial open-source software, released under the MIT license.
 📊 **Help us improve the library!** `Take the spaCy user survey <https://survey.spacy.io>`_.
@ -320,6 +320,8 @@ and ``--model`` are optional and enable additional tests:
 =========== ============== ===========
 Version     Date           Description
 =========== ============== ===========
 `v1.8.2`_   ``2017-04-26`` French model and small improvements
 `v1.8.1`_   ``2017-04-23`` Saving, loading and training bug fixes
 `v1.8.0`_   ``2017-04-16`` Better NER training, saving and loading
 `v1.7.5`_   ``2017-04-07`` Bug fixes and new CLI commands
 `v1.7.3`_   ``2017-03-26`` Alpha support for Hebrew, new CLI commands and bug fixes
@ -351,6 +353,8 @@ Version     Date           Description
 `v0.93`_    ``2015-09-22`` Bug fixes to word vectors
 =========== ============== ===========
 .. _v1.8.2: https://github.com/explosion/spaCy/releases/tag/v1.8.2
 .. _v1.8.1: https://github.com/explosion/spaCy/releases/tag/v1.8.1
 .. _v1.8.0: https://github.com/explosion/spaCy/releases/tag/v1.8.0
 .. _v1.7.5: https://github.com/explosion/spaCy/releases/tag/v1.7.5
 .. _v1.7.3: https://github.com/explosion/spaCy/releases/tag/v1.7.3
--- a/examples/training/train_new_entity_type.py
+++ b/examples/training/train_new_entity_type.py
@ -1,4 +1,5 @@
 #!/usr/bin/env python
 # coding: utf8
 """
 Example of training an additional entity type
@ -26,11 +27,11 @@ For more details, see the documentation:
 Developed for: spaCy 1.7.6
 Last tested for: spaCy 1.7.6
 """
 # coding: utf8
 from __future__ import unicode_literals, print_function
 import random
 from pathlib import Path
 import random
 import spacy
 from spacy.gold import GoldParse
@ -43,14 +44,35 @@ def train_ner(nlp, train_data, output_dir):
        doc = nlp.make_doc(raw_text)
        for word in doc:
            _ = nlp.vocab[word.orth]
-
+    random.seed(0)
-    for itn in range(20):
+    # You may need to change the learning rate. It's generally difficult to
    # guess what rate you should set, especially when you have limited data.
    nlp.entity.model.learn_rate = 0.001
    for itn in range(1000):
        random.shuffle(train_data)
        loss = 0.
        for raw_text, entity_offsets in train_data:
            gold = GoldParse(doc, entities=entity_offsets)
            # By default, the GoldParse class assumes that the entities
            # described by offset are complete, and all other words should
            # have the tag 'O'. You can tell it to make no assumptions
            # about the tag of a word by giving it the tag '-'.
            # However, this allows a trivial solution to the current
            # learning problem: if words are either 'any tag' or 'ANIMAL',
            # the model can learn that all words can be tagged 'ANIMAL'.
            #for i in range(len(gold.ner)):
                #if not gold.ner[i].endswith('ANIMAL'):
                #    gold.ner[i] = '-'
            doc = nlp.make_doc(raw_text)
            nlp.tagger(doc)
-            loss = nlp.entity.update(doc, gold)
+            # As of 1.9, spaCy's parser now lets you supply a dropout probability
            # This might help the model generalize better from only a few
            # examples.
            loss += nlp.entity.update(doc, gold, drop=0.9)
        if loss == 0:
            break
    # This step averages the model's weights. This may or may not be good for
    # your situation --- it's empirical.
    nlp.end_training()
    if output_dir:
        if not output_dir.exists():
@ -80,13 +102,19 @@ def main(model_name, output_directory=None):
        (
            "they pretend to care about your feelings, those horses",
            [(48, 54, 'ANIMAL')]
        ),
        (
            "horses?",
            [(0, 6, 'ANIMAL')]
        )
    ]
    nlp.entity.add_label('ANIMAL')
    train_ner(nlp, train_data, output_directory)
    # Test that the entity is recognized
    doc = nlp('Do you like horses?')
    print("Ents in 'Do you like horses?':")
    for ent in doc.ents:
        print(ent.label_, ent.text)
    if output_directory:
--- a/setup.py
+++ b/setup.py
@ -36,6 +36,7 @@ PACKAGES = [
    'spacy.fi',
    'spacy.bn',
    'spacy.he',
    'spacy.nb',    
    'spacy.en.lemmatizer',
    'spacy.cli.converters',
    'spacy.language_data',
--- a/spacy/init.py
+++ b/spacy/init.py
@ -3,14 +3,14 @@ from __future__ import unicode_literals
 from . import util
 from .deprecated import resolve_model_name
-from .cli import info
+from .cli.info import info
-from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he
+from . import en, de, zh, es, it, hu, fr, pt, nl, sv, fi, bn, he, nb
 _languages = (en.English, de.German, es.Spanish, pt.Portuguese, fr.French,
             it.Italian, hu.Hungarian, zh.Chinese, nl.Dutch, sv.Swedish,
-             fi.Finnish, bn.Bengali, he.Hebrew)
+             fi.Finnish, bn.Bengali, he.Hebrew, nb.Norwegian)
 for _lang in _languages:
--- a/spacy/about.py
+++ b/spacy/about.py
@ -3,7 +3,7 @@
 # https://github.com/pypa/warehouse/blob/master/warehouse/__about__.py
 __title__ = 'spacy'
-__version__ = '1.8.0'
+__version__ = '1.8.2'
 __summary__ = 'Industrial-strength Natural Language Processing (NLP) with Python and Cython'
 __uri__ = 'https://spacy.io'
 __author__ = 'Matthew Honnibal'
@ -13,4 +13,4 @@ __license__ = 'MIT'
 __docs__ = 'https://spacy.io/docs/usage'
 __download_url__ = 'https://github.com/explosion/spacy-models/releases/download'
 __compatibility__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json'
-__shortcuts__ = {'en': 'en_core_web_sm', 'de': 'de_core_news_md', 'vectors': 'en_vectors_glove_md'}
+__shortcuts__ = 'https://raw.githubusercontent.com/explosion/spacy-models/master/shortcuts.json'
--- a/spacy/cli/download.py
+++ b/spacy/cli/download.py
@ -17,29 +17,37 @@ def download(model=None, direct=False):
    if direct:
        download_model('{m}/{m}.tar.gz'.format(m=model))
    else:
-        model_name = about.__shortcuts__[model] if model in about.__shortcuts__ else model
+        model_name = check_shortcut(model)
        compatibility = get_compatibility()
        version = get_version(model_name, compatibility)
        download_model('{m}-{v}/{m}-{v}.tar.gz'.format(m=model_name, v=version))
        link_package(model_name, model, force=True)
-def get_compatibility():
+def get_json(url, desc):
-    version = about.__version__
+    r = requests.get(url)
    r = requests.get(about.__compatibility__)
    if r.status_code != 200:
        util.sys_exit(
-            "Couldn't fetch compatibility table. Please find the right model for "
+            "Couldn't fetch {d}. Please find the right model for your spaCy "
-            "your spaCy installation (v{v}), and download it manually:".format(v=version),
+            "installation (v{v}), and download it manually:".format(d=desc, v=about.__version__),
            "python -m spacy.download [full model name + version] --direct",
            title="Server error ({c})".format(c=r.status_code))
    return r.json()
-    comp = r.json()['spacy']
+
 def check_shortcut(model):
    shortcuts = get_json(about.__shortcuts__, "available shortcuts")
    return shortcuts.get(model, model)
 def get_compatibility():
    version = about.__version__
    comp_table = get_json(about.__compatibility__, "compatibility table")
    comp = comp_table['spacy']
    if version not in comp:
        util.sys_exit(
            "No compatible models found for v{v} of spaCy.".format(v=version),
            title="Compatibility error")
    else:
    return comp[version]
@ -47,7 +55,7 @@ def get_version(model, comp):
    if model not in comp:
        util.sys_exit(
            "No compatible model found for "
-            "{m} (spaCy v{v}).".format(m=model, v=about.__version__),
+            "'{m}' (spaCy v{v}).".format(m=model, v=about.__version__),
            title="Compatibility error")
    return comp[model][0]
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals, division, print_function
 import json
 from collections import defaultdict
 from ..util import ensure_path
 from ..scorer import Scorer
@ -62,10 +63,10 @@ def train_model(Language, train_data, dev_data, output_path, tagger_cfg, parser_
        for itn, epoch in enumerate(trainer.epochs(n_iter, augment_data=None)):
            for doc, gold in epoch:
                trainer.update(doc, gold)
-            dev_scores = trainer.evaluate(dev_data) if dev_data else []
+            dev_scores = trainer.evaluate(dev_data).scores if dev_data else defaultdict(float)
            print_progress(itn, trainer.nlp.parser.model.nr_weight,
                           trainer.nlp.parser.model.nr_active_feat,
-                           **dev_scores.scores)
+                           **dev_scores)
 def evaluate(Language, gold_tuples, output_path):
--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -23,6 +23,8 @@ is_windows = sys.platform.startswith('win')
 is_linux = sys.platform.startswith('linux')
 is_osx = sys.platform == 'darwin'
 fix_text = ftfy.fix_text
 if is_python2:
    bytes_ = str
@ -39,9 +41,6 @@ elif is_python3:
    json_dumps = lambda data: ujson.dumps(data, indent=2)
 fix_text = lambda text: ftfy.fix_text(text)
 def symlink_to(orig, dest):
    if is_python2 and is_windows:
        import subprocess
--- a/spacy/language.py
+++ b/spacy/language.py
@ -195,7 +195,7 @@ class Language(object):
            if directory.exists():
                shutil.rmtree(str(directory))
            directory.mkdir()
-            with (directory / 'config.json').open('wb') as file_:
+            with (directory / 'config.json').open('w') as file_:
                data = json_dumps(config)
                file_.write(data)
        if not (path / 'vocab').exists():
@ -366,6 +366,8 @@ class Language(object):
        }
        path = util.ensure_path(path)
        if not path.exists():
            path.mkdir()
        self.setup_directory(path, **configs)
        strings_loc = path / 'vocab' / 'strings.json'
--- a/spacy/nb/init.py
+++ b/spacy/nb/init.py
@ -0,0 +1,25 @@
 # encoding: utf8
 from __future__ import unicode_literals, print_function
 from os import path
 from ..language import Language
 from ..attrs import LANG
 # Import language-specific data
 from .language_data import *
 # create Language subclass
 class Norwegian(Language):
    lang = 'nb' # ISO code
    class Defaults(Language.Defaults):
        lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
        lex_attr_getters[LANG] = lambda text: 'nb'
        # override defaults
        tokenizer_exceptions = TOKENIZER_EXCEPTIONS
        #tag_map = TAG_MAP
        stop_words = STOP_WORDS
--- a/spacy/nb/language_data.py
+++ b/spacy/nb/language_data.py
@ -0,0 +1,28 @@
 # encoding: utf8
 from __future__ import unicode_literals
 # import base language data
 from .. import language_data as base
 # import util functions
 from ..language_data import update_exc, strings_to_exc, expand_exc
 # import language-specific data from files 
 #from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS, ORTH_ONLY
 from .morph_rules import MORPH_RULES
 TOKENIZER_EXCEPTIONS = dict(TOKENIZER_EXCEPTIONS)
 #TAG_MAP = dict(TAG_MAP) 
 STOP_WORDS = set(STOP_WORDS)
 # customize tokenizer exceptions 
 update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(ORTH_ONLY))
 update_exc(TOKENIZER_EXCEPTIONS, expand_exc(TOKENIZER_EXCEPTIONS, "'", "’"))
 update_exc(TOKENIZER_EXCEPTIONS, strings_to_exc(base.EMOTICONS))
 # export 
 __all__ = ["TOKENIZER_EXCEPTIONS", "STOP_WORDS", "MORPH_RULES"]
--- a/spacy/nb/morph_rules.py
+++ b/spacy/nb/morph_rules.py
@ -0,0 +1,67 @@
 # encoding: utf8
 # norwegian bokmål
 from __future__ import unicode_literals
 from ..symbols import *
 from ..language_data import PRON_LEMMA
 # Used the table of pronouns at https://no.wiktionary.org/wiki/Tillegg:Pronomen_i_norsk
 MORPH_RULES = {
    "PRP": {
        "jeg":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Nom"},
        "meg":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Case": "Acc"},
        "du":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Nom"},
        "deg":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Case": "Acc"},        
        "han":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Nom"},
        "ham":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
        "han":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Masc", "Case": "Acc"},
        "hun":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem",  "Case": "Nom"},
        "henne":        {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Fem",  "Case": "Acc"},
        "den":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
        "det":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Gender": "Neut"},
        "seg":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Reflex": "Yes"},
        "vi":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Nom"},
        "oss":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Case": "Acc"},
        "dere":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Case": "Nom"},
        "de":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Nom"},
        "dem":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Case": "Acc"},
        "seg":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Plur", "Reflex": "Yes"},
        "min":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
        "mi":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
        "mitt":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
        "mine":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
        "din":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
        "di":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
        "ditt":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
        "dine":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Two", "Number": "Plur", "Poss": "Yes"},
        "hans":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Masc"},
        "hennes":       {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Fem"},
        "dens":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
        "dets":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "Three", "Number": "Sing", "Poss": "Yes", "Gender": "Neu"},
        "vår":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
        "vårt":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes"},
        "våre":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Gender":"Neu"},
        "deres":        {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Gender":"Neu", "Reflex":"Yes"},
        "sin":          {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Masc", "Reflex":"Yes"},
        "si":           {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Fem", "Reflex":"Yes"},
        "sitt":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Sing", "Poss": "Yes", "Gender":"Neu", "Reflex":"Yes"},
        "sine":         {LEMMA: PRON_LEMMA, "PronType": "Prs", "Person": "One", "Number": "Plur", "Poss": "Yes", "Reflex":"Yes"},
    },
    "VBZ": {
        "er":           {LEMMA: "be", "VerbForm": "Fin", "Person": "One", "Tense": "Pres", "Mood": "Ind"},
        "er":           {LEMMA: "be", "VerbForm": "Fin", "Person": "Two", "Tense": "Pres", "Mood": "Ind"},
        "er":           {LEMMA: "be", "VerbForm": "Fin", "Person": "Three", "Tense": "Pres", "Mood": "Ind"},
    },
    "VBP": {
        "er":          {LEMMA: "be", "VerbForm": "Fin", "Tense": "Pres", "Mood": "Ind"}
    },
    "VBD": {
        "var":          {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Sing"},
        "vært":         {LEMMA: "be", "VerbForm": "Fin", "Tense": "Past", "Number": "Plur"}
    }
 }
--- a/spacy/nb/stop_words.py
+++ b/spacy/nb/stop_words.py
@ -0,0 +1,49 @@
 # encoding: utf8
 from __future__ import unicode_literals
 STOP_WORDS = set("""
 alle allerede alt and andre annen annet at av
 bak bare bedre beste blant ble bli blir blitt bris by både 
 da dag de del dem den denne der dermed det dette disse drept du
 eller en enn er et ett etter 
 fem fikk fire fjor flere folk for fortsatt fotball fra fram frankrike fredag funnet få får fått før først første
 gang gi gikk gjennom gjorde gjort gjør gjøre god godt grunn gå går 
 ha hadde ham han hans har hele helt henne hennes her hun hva hvor hvordan hvorfor
 i ifølge igjen ikke ingen inn
 ja jeg
 kamp kampen kan kl klart kom komme kommer kontakt kort kroner kunne kveld kvinner
 la laget land landet langt leder ligger like litt løpet lørdag
 man mandag mange mannen mars med meg mellom men mener menn mennesker mens mer millioner minutter mot msci mye må mål måtte 
 ned neste noe noen nok norge norsk norske ntb ny nye nå når
 og også om onsdag opp opplyser oslo oss over
 personer plass poeng politidistrikt politiet president prosent på
 regjeringen runde rundt russland
 sa saken samme sammen samtidig satt se seg seks selv senere september ser sett siden sier sin sine siste sitt skal skriver skulle slik som sted stedet stor store står sverige svært så søndag
 ta tatt tid tidligere til tilbake tillegg tirsdag to tok torsdag tre tror tyskland
 under usa ut uten utenfor 
 vant var ved veldig vi videre viktig vil ville viser vår være vært
 å år
 ønsker
 """.split())
--- a/spacy/nb/tokenizer_exceptions.py
+++ b/spacy/nb/tokenizer_exceptions.py
@ -0,0 +1,175 @@
 # encoding: utf8
 # Norwegian bokmaål
 from __future__ import unicode_literals
 from ..symbols import *
 from ..language_data import PRON_LEMMA
 TOKENIZER_EXCEPTIONS = {
    "jan.": [
        {ORTH: "jan.", LEMMA: "januar"}
    ],
    "feb.": [
        {ORTH: "feb.", LEMMA: "februar"}
    ],
    "jul.": [
        {ORTH: "jul.", LEMMA: "juli"}
    ]
 }
 ORTH_ONLY = ["adm.dir.",
    "a.m.",
    "Aq.",
    "b.c.",
    "bl.a.",
    "bla.",
    "bm.",
    "bto.",
    "ca.",
    "cand.mag.",
    "c.c.",
    "co.",
    "d.d.",
    "dept.",
    "d.m.",
    "dr.philos.",
    "dvs.",
    "d.y.",
    "E. coli",
    "eg.",
    "ekskl.",
    "e.Kr.",
    "el.",
    "e.l.",
    "et.",
    "etg.",
    "ev.",
    "evt.",
    "f.",
    "f.eks.",
    "fhv.",
    "fk.",
    "f.Kr.",
    "f.o.m.",
    "foreg.",
    "fork.",
    "fv.",
    "fvt.",
    "g.",
    "gt.",
    "gl.",
    "gno.",
    "gnr.",
    "grl.",
    "hhv.",
    "hoh.",
    "hr.",
    "h.r.adv.",
    "ifb.",
    "ifm.",
    "iht.",
    "inkl.",
    "istf.",
    "jf.",
    "jr.",
    "jun.",
    "kfr.",
    "kgl.res.",
    "kl.",
    "komm.",
    "kst.",
    "lø.",
    "ma.",
    "mag.art.",
    "m.a.o.",
    "md.",
    "mfl.",
    "mill.",
    "min.",
    "m.m.",
    "mnd.",
    "moh.",
    "Mr.",
    "muh.",
    "mv.",
    "mva.",
    "ndf.",
    "no.",
    "nov.",
    "nr.",
    "nto.",
    "nyno.",
    "n.å.",
    "o.a.",
    "off.",
    "ofl.",
    "okt.",
    "o.l.",
    "on.",
    "op.",
    "osv.",
    "ovf.",
    "p.",
    "p.a.",
    "Pb.",
    "pga.",
    "ph.d.",
    "pkt.",
    "p.m.",
    "pr.",
    "pst.",
    "p.t.",
    "red.anm.",
    "ref.",
    "res.",
    "res.kap.",
    "resp.",
    "rv.",
    "s.",
    "s.d.",
    "sen.",
    "sep.",
    "siviling.",
    "sms.",
    "spm.",
    "sr.",
    "sst.",
    "st.",
    "stip.",
    "stk.",
    "st.meld.",
    "st.prp.",
    "stud.",
    "s.u.",
    "sv.",
    "sø.",
    "s.å.",
    "såk.",
    "temp.",
    "ti.",
    "tils.",
    "tilsv.",
    "tl;dr",
    "tlf.",
    "to.",
    "t.o.m.",
    "ult.",
    "utg.",
    "v.",
    "vedk.",
    "vedr.",
    "vg.",
    "vgs.",
    "vha.",
    "vit.ass.",
    "vn.",
    "vol.",
    "vs.",
    "vsa.",
    "årg.",
    "årh."
 ]
--- a/spacy/syntax/iterators.pyx
+++ b/spacy/syntax/iterators.pyx
@ -10,15 +10,22 @@ def english_noun_chunks(obj):
    Works on both Doc and Span.
    """
    labels = ['nsubj', 'dobj', 'nsubjpass', 'pcomp', 'pobj',
-              'attr', 'ROOT', 'root']
+              'attr', 'ROOT']
    doc = obj.doc # Ensure works on both Doc and Span.
    np_deps = [doc.vocab.strings[label] for label in labels]
    conj = doc.vocab.strings['conj']
    np_label = doc.vocab.strings['NP']
    seen = set()
    for i, word in enumerate(obj):
        if word.pos not in (NOUN, PROPN, PRON):
            continue
        # Prevent nested chunks from being produced
        if word.i in seen:
            continue
        if word.dep in np_deps:
            if any(w.i in seen for w in word.subtree):
                continue
            seen.update(j for j in range(word.left_edge.i, word.i+1))
            yield word.left_edge.i, word.i+1, np_label
        elif word.dep == conj:
            head = word.head
@ -26,6 +33,9 @@ def english_noun_chunks(obj):
                head = head.head
            # If the head is an NP, and we're coordinated to it, we're an NP
            if head.dep in np_deps:
                if any(w.i in seen for w in word.subtree):
                    continue
                seen.update(j for j in range(word.left_edge.i, word.i+1))
                yield word.left_edge.i, word.i+1, np_label
--- a/spacy/syntax/parser.pyx
+++ b/spacy/syntax/parser.pyx
@ -11,6 +11,8 @@ import ujson
 cimport cython
 cimport cython.parallel
 import numpy.random
 from cpython.ref cimport PyObject, Py_INCREF, Py_XDECREF
 from cpython.exc cimport PyErr_CheckSignals
 from libc.stdint cimport uint32_t, uint64_t
@ -146,7 +148,7 @@ cdef class Parser:
        if 'labels' in cfg and 'actions' not in cfg:
            cfg['actions'] = cfg.pop('labels')
        # TODO: remove this shim when we don't have to support older data
-        for action_name, labels in dict(cfg['actions']).items():
+        for action_name, labels in dict(cfg.get('actions', {})).items():
            # We need this to be sorted
            if isinstance(labels, dict):
                labels = list(sorted(labels.keys()))
@ -303,7 +305,7 @@ cdef class Parser:
        free(eg.is_valid)
        return 0
-    def update(self, Doc tokens, GoldParse gold, itn=0):
+    def update(self, Doc tokens, GoldParse gold, itn=0, double drop=0.0):
        """
        Update the statistical model.
@ -325,9 +327,11 @@ cdef class Parser:
                nr_feat=self.model.nr_feat)
        cdef weight_t loss = 0
        cdef Transition action
        cdef double dropout_rate = self.cfg.get('dropout', drop)
        while not stcls.is_final():
            eg.c.nr_feat = self.model.set_featuresC(eg.c.atoms, eg.c.features,
                                                    stcls.c)
            dropout(eg.c.features, eg.c.nr_feat, dropout_rate)
            self.moves.set_costs(eg.c.is_valid, eg.c.costs, stcls, gold)
            self.model.set_scoresC(eg.c.scores, eg.c.features, eg.c.nr_feat)
            guess = VecVec.arg_max_if_true(eg.c.scores, eg.c.is_valid, eg.c.nr_class)
@ -378,6 +382,18 @@ cdef class Parser:
                self.cfg.setdefault('extra_labels', []).append(label)
 cdef int dropout(FeatureC* feats, int nr_feat, float prob) except -1:
    if prob <= 0 or prob >= 1.:
        return 0
    cdef double[::1] py_probs = numpy.random.uniform(0., 1., nr_feat)
    cdef double* probs = &py_probs[0]
    for i in range(nr_feat):
        if probs[i] >= prob:
            feats[i].value /= prob
        else:
            feats[i].value = 0.
 cdef class StepwiseState:
    cdef readonly StateClass stcls
    cdef readonly Example eg
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -13,6 +13,8 @@ from ..hu import Hungarian
 from ..fi import Finnish
 from ..bn import Bengali
 from ..he import Hebrew
 from ..nb import Norwegian
 from ..tokens import Doc
 from ..strings import StringStore
@ -26,7 +28,7 @@ import pytest
 LANGUAGES = [English, German, Spanish, Italian, French, Portuguese, Dutch,
-             Swedish, Hungarian, Finnish, Bengali]
+             Swedish, Hungarian, Finnish, Bengali, Norwegian]
@pytest.fixture(params=LANGUAGES)
@ -88,6 +90,9 @@ def bn_tokenizer():
 def he_tokenizer():
    return Hebrew.Defaults.create_tokenizer()
@pytest.fixture
 def nb_tokenizer():
    return Norwegian.Defaults.create_tokenizer()
@pytest.fixture
 def stringstore():
--- a/spacy/tests/nb/init.py
+++ b/spacy/tests/nb/init.py
--- a/spacy/tests/nb/test_tokenizer.py
+++ b/spacy/tests/nb/test_tokenizer.py
@ -0,0 +1,17 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 NB_TOKEN_EXCEPTION_TESTS = [
    ('Smørsausen brukes bl.a. til fisk', ['Smørsausen', 'brukes', 'bl.a.', 'til', 'fisk']),
    ('Jeg kommer først kl. 13 pga. diverse forsinkelser', ['Jeg', 'kommer', 'først', 'kl.', '13', 'pga.', 'diverse', 'forsinkelser'])
 ]
@pytest.mark.parametrize('text,expected_tokens', NB_TOKEN_EXCEPTION_TESTS)
 def test_tokenizer_handles_exception_cases(nb_tokenizer, text, expected_tokens):
    tokens = nb_tokenizer(text)
    token_list = [token.text for token in tokens if not token.is_space]
    assert expected_tokens == token_list
--- a/spacy/tests/regression/test_issue758.py
+++ b/spacy/tests/regression/test_issue758.py
@ -1,3 +1,4 @@
 from __future__ import unicode_literals
 from ... import load as load_spacy
 from ...attrs import LEMMA
 from ...matcher import merge_phrase
--- a/spacy/tests/regression/test_issue910.py
+++ b/spacy/tests/regression/test_issue910.py
@ -70,7 +70,6 @@ def temp_save_model(model):
@pytest.mark.xfail
@pytest.mark.models
 def test_issue910(train_data, additional_entity_types):
    '''Test that adding entities and resuming training works passably OK.
@ -85,11 +84,10 @@ def test_issue910(train_data, additional_entity_types):
    ents_before_train = [(ent.label_, ent.text) for ent in doc.ents]
    # Fine tune the ner model
    for entity_type in additional_entity_types:
        if entity_type not in nlp.entity.cfg['actions']['1']:
        nlp.entity.add_label(entity_type)
-    nlp.entity.learn_rate = 0.001
+    nlp.entity.model.learn_rate = 0.001
-    for itn in range(4):
+    for itn in range(10):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
@ -101,13 +99,12 @@ def test_issue910(train_data, additional_entity_types):
        # Load the fine tuned model
        loaded_ner = EntityRecognizer.load(model_dir, nlp.vocab)
-    for entity_type in additional_entity_types:
+    for raw_text, entity_offsets in train_data:
-        if entity_type not in loaded_ner.cfg['actions']['1']:
+        doc = nlp.make_doc(raw_text)
            loaded_ner.add_label(entity_type)
    doc = nlp(u"I am looking for a restaurant in Berlin", entity=False)
        nlp.tagger(doc)
        loaded_ner(doc)
-
+        ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
-    ents_after_train = [(ent.label_, ent.text) for ent in doc.ents]
+        for start, end, label in entity_offsets:
-    assert ents_before_train == ents_after_train
+            if (start, end) not in ents:
                print(ents)
            assert ents[(start, end)] == label
--- a/spacy/tests/regression/test_issue995.py
+++ b/spacy/tests/regression/test_issue995.py
@ -0,0 +1,22 @@
 from __future__ import unicode_literals
 import pytest
 from ... import load as load_spacy
@pytest.fixture
 def doc():
    nlp = load_spacy('en')
    return nlp('Does flight number three fifty-four require a connecting flight'
               ' to get to Boston?')
@pytest.mark.models
 def test_issue955(doc):
    '''Test that we don't have any nested noun chunks'''
    seen_tokens = set()
    for np in doc.noun_chunks:
        print(np.text, np.root.text, np.root.dep_, np.root.tag_)
        for word in np:
            key = (word.i, word.text)
            assert key not in seen_tokens
            seen_tokens.add(key)
--- a/spacy/tests/regression/test_issue999.py
+++ b/spacy/tests/regression/test_issue999.py
@ -0,0 +1,78 @@
 from __future__ import unicode_literals
 import os
 import random
 import contextlib
 import shutil
 import pytest
 import tempfile
 from pathlib import Path
 import pathlib
 from ...gold import GoldParse
 from ...pipeline import EntityRecognizer
 from ...language import Language
 try:
    unicode
 except NameError:
    unicode = str
@pytest.fixture
 def train_data():
    return [
            ["hey",[]],
            ["howdy",[]],
            ["hey there",[]],
            ["hello",[]],
            ["hi",[]],
            ["i'm looking for a place to eat",[]],
            ["i'm looking for a place in the north of town",[[31,36,"location"]]],
            ["show me chinese restaurants",[[8,15,"cuisine"]]],
            ["show me chines restaurants",[[8,14,"cuisine"]]],
    ]
@contextlib.contextmanager
 def temp_save_model(model):
    model_dir = Path(tempfile.mkdtemp())
    model.save_to_directory(model_dir)
    yield model_dir
    shutil.rmtree(model_dir.as_posix())
 def test_issue999(train_data):
    '''Test that adding entities and resuming training works passably OK.
    There are two issues here:
    1) We have to readd labels. This isn't very nice.
    2) There's no way to set the learning rate for the weight update, so we
        end up out-of-scale, causing it to learn too fast.
    '''
    nlp = Language(path=None, entity=False, tagger=False, parser=False)
    nlp.entity = EntityRecognizer(nlp.vocab, features=Language.Defaults.entity_features)
    for _, offsets in train_data:
        for start, end, ent_type in offsets:
            nlp.entity.add_label(ent_type)
    nlp.entity.model.learn_rate = 0.001
    for itn in range(100):
        random.shuffle(train_data)
        for raw_text, entity_offsets in train_data:
            doc = nlp.make_doc(raw_text)
            gold = GoldParse(doc, entities=entity_offsets)
            loss = nlp.entity.update(doc, gold)
    with temp_save_model(nlp) as model_dir:
        nlp2 = Language(path=model_dir)
    for raw_text, entity_offsets in train_data:
        doc = nlp2(raw_text)
        ents = {(ent.start_char, ent.end_char): ent.label_ for ent in doc.ents}
        for start, end, label in entity_offsets:
            if (start, end) in ents:
                assert ents[(start, end)] == label
                break
        else:
            if entity_offsets:
                raise Exception(ents)
--- a/spacy/tests/spans/test_span.py
+++ b/spacy/tests/spans/test_span.py
@ -77,3 +77,15 @@ def test_spans_override_sentiment(en_tokenizer):
    assert doc[:2].sentiment == 10.0
    assert doc[-2:].sentiment == 10.0
    assert doc[:-1].sentiment == 10.0
 def test_spans_are_hashable(en_tokenizer):
    """Test spans can be hashed."""
    text = "good stuff bad stuff"
    tokens = en_tokenizer(text)
    span1 = tokens[:2]
    span2 = tokens[2:4]
    assert hash(span1) != hash(span2)
    span3 = tokens[0:2]
    assert hash(span3) == hash(span1)
--- a/spacy/tests/test_download.py
+++ b/spacy/tests/test_download.py
@ -2,17 +2,18 @@
 from __future__ import unicode_literals
 from ..cli.download import download, get_compatibility, get_version, check_error_depr
 import pytest
@pytest.mark.parametrize('model', ['en_core_web_md'])
-def test_download_get_matching_version_succeeds(model):
+def test_cli_download_get_matching_version_succeeds(model):
    comp = { model: ['1.7.0', '0.100.0'] }
    assert get_version(model, comp)
@pytest.mark.parametrize('model', ['en_core_web_md'])
-def test_download_get_matching_version_fails(model):
+def test_cli_download_get_matching_version_fails(model):
    diff_model = 'test_' + model
    comp = { diff_model: ['1.7.0', '0.100.0'] }
    with pytest.raises(SystemExit):
@ -20,6 +21,6 @@ def test_download_get_matching_version_fails(model):
@pytest.mark.parametrize('model', [False, None, '', 'all'])
-def test_download_no_model_depr_error(model):
+def test_cli_download_no_model_depr_error(model):
    with pytest.raises(SystemExit):
        check_error_depr(model)
--- a/spacy/tests/test_misc.py
+++ b/spacy/tests/test_misc.py
@ -0,0 +1,13 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from ..util import ensure_path
 from pathlib import Path
 import pytest
@pytest.mark.parametrize('text', ['hello/world', 'hello world'])
 def test_util_ensure_path_succeeds(text):
    path = ensure_path(text)
    assert isinstance(path, Path)
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@ -66,6 +66,10 @@ cdef class Span:
        elif op == 5:
            return self.start_char >= other.start_char
    def __hash__(self):
        return hash((self.doc, self.label, self.start_char, self.end_char))
    def __len__(self):
        self._recalculate_indices()
        if self.end < self.start:
--- a/website/_harp.json
+++ b/website/_harp.json
@ -55,7 +55,7 @@
            }
        },
-        "V_CSS": "1.4",
+        "V_CSS": "1.5",
        "V_JS": "1.2",
        "DEFAULT_SYNTAX": "python",
        "ANALYTICS": "UA-58931649-1",
--- a/website/_includes/_mixins.jade
+++ b/website/_includes/_mixins.jade
@ -198,8 +198,8 @@ mixin table(head)
 //- Table row (only used within +table)
-mixin row()
+mixin row(...style)
-    tr.c-table__row&attributes(attributes)
+    tr.c-table__row(class=prefixArgs(style, "c-table__row"))&attributes(attributes)
        block
@ -283,3 +283,21 @@ mixin card-item(title, details)
        if details.author
            br
            span.u-text-small.u-color-subtle by #{details.author}
 //- Model row for models table
 mixin model-row(name, lang, procon, size, license, default_model, divider)
    - var licenses = { "CC BY-SA": "https://creativecommons.org/licenses/by-sa/3.0/", "CC BY-NC": "https://creativecommons.org/licenses/by-nc/3.0/" }
    +row(divider ? "divider": null)
        +cell #[code=name]
            if default_model
                |  #[span.u-color-theme(title="default model") #[+icon("star", 16)]]
        +cell=lang
        each icon in procon
            +cell.u-text-center #[+procon(icon ? "pro" : "con")]
        +cell.u-text-right=size
        +cell
            if license in licenses
                +a(licenses[license])=license
--- a/website/assets/css/_components/_tables.sass
+++ b/website/assets/css/_components/_tables.sass
@ -20,6 +20,9 @@
            @extend .u-text-label
            color: $color-theme
    &.c-table__row--divider
        border-top: 2px solid $color-theme
 //- Table cell
--- a/website/assets/img/icons.svg
+++ b/website/assets/img/icons.svg
@ -24,5 +24,8 @@
        <symbol id="chat" viewBox="0 0 24 24">
            <path d="M18 8.016v-2.016h-12v2.016h12zM18 11.016v-2.016h-12v2.016h12zM18 14.016v-2.016h-12v2.016h12zM21.984 3.984v18l-3.984-3.984h-14.016c-1.078 0-1.969-0.938-1.969-2.016v-12c0-1.078 0.891-1.969 1.969-1.969h16.031c1.078 0 1.969 0.891 1.969 1.969z"></path>
        </symbol>
        <symbol id="star" viewBox="0 0 24 24">
            <path d="M12 17.25l-6.188 3.75 1.641-7.031-5.438-4.734 7.172-0.609 2.813-6.609 2.813 6.609 7.172 0.609-5.438 4.734 1.641 7.031z"></path>
        </symbol>
    </defs>
 </svg>
--- a/website/docs/api/language-models.jade
+++ b/website/docs/api/language-models.jade
@ -7,6 +7,7 @@ p spaCy currently supports the following languages and capabilities:
 +aside-code("Download language models", "bash").
    python -m spacy download en
    python -m spacy download de
    python -m spacy download fr
 +table([ "Language", "Token", "SBD", "Lemma", "POS", "NER", "Dep", "Vector", "Sentiment"])
    +row
@ -19,6 +20,14 @@ p spaCy currently supports the following languages and capabilities:
        each icon in [ "pro", "pro", "con", "pro", "pro", "pro", "pro", "con" ]
            +cell.u-text-center #[+procon(icon)]
    +row
        +cell French #[code fr]
        each icon in [ "pro", "pro", "con", "pro", "con", "pro", "pro", "con" ]
            +cell.u-text-center #[+procon(icon)]
 +h(2, "available") Available models
 include ../usage/_models-list
 +h(2, "alpha-support") Alpha support
@ -27,7 +36,7 @@ p
    |  the existing language data and extending the tokenization patterns.
 +table([ "Language", "Source" ])
-    each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", fr: "French", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", hu: "Hungarian", bn: "Bengali", he: "Hebrew" }
+    each language, code in { zh: "Chinese", es: "Spanish", it: "Italian", pt: "Portuguese", nl: "Dutch", sv: "Swedish", fi: "Finnish", nb: "Norwegian Bokmål", hu: "Hungarian", bn: "Bengali", he: "Hebrew" }
        +row
            +cell #{language} #[code=code]
            +cell
--- a/website/docs/usage/_models-list.jade
+++ b/website/docs/usage/_models-list.jade
@ -0,0 +1,27 @@
 //- 💫 DOCS > USAGE > MODELS LIST
 include ../../_includes/_mixins
 p
    |  Model differences are mostly statistical. In general, we do expect larger
    |  models to be "better" and more accurate overall. Ultimately, it depends on
    |  your use case and requirements, and we recommend starting with the default
    |  models (marked with a star below).
 +aside
    |  Models are now available as #[code .tar.gz] archives #[+a(gh("spacy-models")) from GitHub],
    |  attached to individual releases. They can be downloaded and loaded manually,
    |  or using spaCy's #[code download] and #[code link] commands. All models
    |  follow the naming convention of #[code [language]_[type]_[genre]_[size]].
    | #[br]#[br]
    +button(gh("spacy-models"), true, "primary").u-text-tag
        |  View model releases
 +table(["Name", "Language", "Voc", "Dep", "Ent", "Vec", "Size", "License"])
    +model-row("en_core_web_sm", "English", [1, 1, 1, 1], "50 MB", "CC BY-SA", true)
    +model-row("en_core_web_md", "English", [1, 1, 1, 1], "1 GB", "CC BY-SA")
    +model-row("en_depent_web_md", "English", [1, 1, 1, 0], "328 MB", "CC BY-SA")
    +model-row("en_vectors_glove_md", "English", [1, 0, 0, 1], "727 MB", "CC BY-SA")
    +model-row("de_core_news_md", "German", [1, 1, 1, 1], "645 MB", "CC BY-SA", true, true)
    +model-row("fr_depvec_web_lg", "French", [1, 1, 0, 1], "1.33 GB", "CC BY-NC", true, true)
--- a/website/docs/usage/adding-languages.jade
+++ b/website/docs/usage/adding-languages.jade
@ -27,9 +27,10 @@ p
        |  #[a(href="#brown-clusters") Brown clusters] and
        |  #[a(href="#word-vectors") word vectors].
    +item
        |  #[strong Set up] a #[a(href="#model-directory") model direcory] and #[strong train] the #[a(href="#train-tagger-parser") tagger and parser].
 p
    |  Once you have the tokenizer and vocabulary, you can
    |  #[+a("/docs/usage/training") train the tagger, parser and entity recognizer].
    |  For some languages, you may also want to develop a solution for
    |  lemmatization and morphological analysis.
@ -98,6 +99,17 @@ p
    |  so that Python functions can be used to help you generalise and combine
    |  the data as you require.
 +infobox("For languages with non-latin characters")
    |  In order for the tokenizer to split suffixes, prefixes and infixes, spaCy
    |  needs to know the language's character set. If the language you're adding
    |  uses non-latin characters, you might need to add the required character
    |  classes to the global
    |  #[+src(gh("spacy", "spacy/language_data/punctuation.py")) punctuation.py].
    |  spaCy uses the #[+a("https://pypi.python.org/pypi/regex/") #[code regex] library]
    |  to keep this simple and readable. If the language requires very specific
    |  punctuation rules, you should consider overwriting the default regular
    |  expressions with your own in the language's #[code Defaults].
 +h(3, "stop-words") Stop words
 p
@ -395,12 +407,111 @@ p
    |  by linear models, while the word vectors are useful for lexical
    |  similarity models and deep learning.
 +h(3, "word-frequencies") Word frequencies
 p
    |  To generate the word frequencies from a large, raw corpus, you can use the
    |  #[+src(gh("spacy-dev-resources", "training/word_freqs.py")) word_freqs.py]
    |  script from the spaCy developer resources. Note that your corpus should
    |  not be preprocessed (i.e. you need punctuation for example). The
    |  #[+a("/docs/usage/cli#model") #[code model] command] expects a
    |  tab-separated word frequencies file with three columns:
 +list("numbers")
    +item The number of times the word occurred in your language sample.
    +item The number of distinct documents the word occurred in.
    +item The word itself.
 p
    |  An example word frequencies file could look like this:
 +code("es_word_freqs.txt", "text").
    6361109	111	Aunque
    23598543	111	aunque
    10097056	111	claro
    193454	111	aro
    7711123	111	viene
    12812323	111	mal
    23414636	111	momento
    2014580	111	felicidad
    233865	111	repleto
    15527	111	eto
    235565	111	deliciosos
    17259079	111	buena
    71155	111	Anímate
    37705	111	anímate
    33155	111	cuéntanos
    2389171	111	cuál
    961576	111	típico
 p
    |  You should make sure you use the spaCy tokenizer for your
    |  language to segment the text for your word frequencies. This will ensure
    |  that the frequencies refer to the same segmentation standards you'll be
    |  using at run-time. For instance, spaCy's English tokenizer segments
    |  "can't" into two tokens. If we segmented the text by whitespace to
    |  produce the frequency counts, we'll have incorrect frequency counts for
    |  the tokens "ca" and "n't".
 +h(3, "brown-clusters") Training the Brown clusters
 p
    |  spaCy's tagger, parser and entity recognizer are designed to use
    |  distributional similarity features provided by the
    |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
    |  You should train a model with between 500 and 1000 clusters. A minimum
    |  frequency threshold of 10 usually works well.
 p
    |  An example clusters file could look like this:
 +code("es_clusters.data", "text").
    0000	Vestigial	1
    0000	Vesturland	1
    0000	Veyreau	1
    0000	Veynes	1
    0000	Vexilografía	1
    0000	Vetrigne	1
    0000	Vetónica	1
    0000	Asunden	1
    0000	Villalambrús	1
    0000	Vichuquén	1
    0000	Vichtis	1
    0000	Vichigasta	1
    0000	VAAH	1
    0000	Viciebsk	1
    0000	Vicovaro	1
    0000	Villardeveyo	1
    0000	Vidala	1
    0000	Videoguard	1
    0000	Vedás	1
    0000	Videocomunicado	1
    0000	VideoCrypt	1
 +h(3, "word-vectors") Training the word vectors
 p
    |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
    |  algorithms let you train useful word similarity models from unlabelled
    |  text. This is a key part of using
    |  #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
    |  labelled data. The vectors are also useful by themselves – they power
    |  the #[code .similarity()] methods in spaCy. For best results, you should
    |  pre-process the text with spaCy before training the Word2vec model. This
    |  ensures your tokenization will match.
 p
    | You can use our
    |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
    |  which pre-processes the text with your language-specific tokenizer and
    |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
    |  The #[code vectors.bin] file should consist of one word and vector per line.
 +h(2, "model-directory") Setting up a model directory
 p
    |  Once you've collected the word frequencies, Brown clusters and word
    |  vectors files, you can use the
    |  #[+src(gh("spacy-dev-resources", "training/init.py")) init.py]
    |  script from our
    |  #[+a(gh("spacy-dev-resources")) developer resources], or use the new
    |  #[+a("/docs/usage/cli#model") #[code model] command] to create a data
    |  directory:
@ -427,49 +538,22 @@ p
    |  loaded. By default, the command expects to be able to find your language
    |  class using #[code spacy.util.get_lang_class(lang_id)].
-+h(3, "word-frequencies") Word frequencies
+
 +h(2, "train-tagger-parser") Training the tagger and parser
 p
-    |  The #[+a("/docs/usage/cli#model") #[code model] command] expects a
+    |  You can now train the model using a corpus for your language annotated
-    |  tab-separated word frequencies file with three columns:
+    |  with #[+a("http://universaldependencies.org/") Universal Dependencies].
-
+    |  If your corpus uses the 
-+list("numbers")
+    |  #[+a("http://universaldependencies.org/docs/format.html") CoNLL-U] format, 
-    +item The number of times the word occurred in your language sample.
+    |  i.e. files with the extension #[code .conllu], you can use the
-    +item The number of distinct documents the word occurred in.
+    |  #[+a("/docs/usage/cli#convert") #[code convert] command] to convert it to
-    +item The word itself.
+    |  spaCy's #[+a("/docs/api/annotation#json-input") JSON format] for training.
 p
-    |  You should make sure you use the spaCy tokenizer for your
+    |  Once you have your UD corpus transformed into JSON, you can train your
-    |  language to segment the text for your word frequencies. This will ensure
+    |  model use the using spaCy's
-    |  that the frequencies refer to the same segmentation standards you'll be
+    |  #[+a("/docs/usage/cli#train") #[code train] command]:
    |  using at run-time. For instance, spaCy's English tokenizer segments
    |  "can't" into two tokens. If we segmented the text by whitespace to
    |  produce the frequency counts, we'll have incorrect frequency counts for
    |  the tokens "ca" and "n't".
-+h(3, "brown-clusters") Training the Brown clusters
+code(false, "bash").
-
+    python -m spacy train [lang] [output_dir] [train_data] [dev_data] [--n_iter] [--parser_L1] [--no_tagger] [--no_parser] [--no_ner]
 p
    |  spaCy's tagger, parser and entity recognizer are designed to use
    |  distributional similarity features provided by the
    |  #[+a("https://github.com/percyliang/brown-cluster") Brown clustering algorithm].
    |  You should train a model with between 500 and 1000 clusters. A minimum
    |  frequency threshold of 10 usually works well.
 +h(3, "word-vectors") Training the word vectors
 p
    |  #[+a("https://en.wikipedia.org/wiki/Word2vec") Word2vec] and related
    |  algorithms let you train useful word similarity models from unlabelled
    |  text. This is a key part of using
    |  #[+a("/docs/usage/deep-learning") deep learning] for NLP with limited
    |  labelled data. The vectors are also useful by themselves – they power
    |  the #[code .similarity()] methods in spaCy. For best results, you should
    |  pre-process the text with spaCy before training the Word2vec model. This
    |  ensures your tokenization will match.
 p
    | You can use our
    |  #[+src(gh("spacy-dev-resources", "training/word_vectors.py")) word vectors training script],
    |  which pre-processes the text with your language-specific tokenizer and
    |  trains the model using #[+a("https://radimrehurek.com/gensim/") Gensim].
--- a/website/docs/usage/deep-learning.jade
+++ b/website/docs/usage/deep-learning.jade
@ -4,9 +4,9 @@ include ../../_includes/_mixins
 p
    |  In this example, we'll be using #[+a("https://keras.io/") Keras], as
-    |  it's the most popular deep learning library for Python. Let's assume
+    |  it's the most popular deep learning library for Python. Using Keras,
-    |  you've written a custom sentiment analysis model that predicts whether a
+    |  we will write a custom sentiment analysis model that predicts whether a
-    |  document is positive or negative. Now you want to find which entities
+    |  document is positive or negative. Then, we will use it to find which entities
    |  are commonly associated with positive or negative documents. Here's a
    |  quick example of how that can look at runtime.
--- a/website/docs/usage/models.jade
+++ b/website/docs/usage/models.jade
@ -13,14 +13,6 @@ p
    |  internal alias that tells spaCy where to find the data files for a specific
    |  model name.
 +infobox("Important note")
    |  Due to improvements in the English lemmatizer in v1.7.0, you need to
    |  #[strong download the new English models]. The German model is still
    |  compatible. If you've trained statistical models that use spaCy's
    |  annotations, you should #[strong retrain your models after updating spaCy].
    |  If you don't retrain your models, you may suffer train/test skew, which
    |  might decrease your accuracy.
 +aside-code("Quickstart").
    # Install spaCy and download English model
    pip install spacy
@ -31,43 +23,17 @@ p
    nlp = spacy.load('en')
    doc = nlp(u'This is a sentence.')
 +infobox("Important note")
    |  Due to improvements in the English lemmatizer in v1.7.0, you need to
    |  #[strong download the new English models]. The German model is still
    |  compatible. If you've trained statistical models that use spaCy's
    |  annotations, you should #[strong retrain your models after updating spaCy].
    |  If you don't retrain your models, you may suffer train/test skew, which
    |  might decrease your accuracy.
 +h(2, "available") Available models
-+table(["Name", "Size", "Description"])
+include _models-list
    +row
        +cell #[code en_core_web_sm]
        +cell 50 MB
        +cell Vocab, syntax, entities, word vectors #[+tag default]
    +row
        +cell #[code en_core_web_md]
        +cell 1 GB
        +cell Vocab, syntax, entities, word vectors
    +row
        +cell #[code en_depent_web_md]
        +cell 328 MB
        +cell Vocab, syntax, entities
    +row
        +cell #[code en_vectors_glove_md]
        +cell 727 MB
        +cell
            |  #[+a("http://nlp.stanford.edu/projects/glove/") GloVe] Common
            |  Crawl vectors
    +row
        +cell #[code de_core_news_md]
        +cell 645 MB
        +cell Vocab, syntax, entities, word vectors #[+tag default]
 p
    |  Models are now available as #[code .tar.gz] archives #[+a(gh("spacy-models")) from GitHub],
    |  attached to individual releases. They can be downloaded and loaded manually,
    |  or using spaCy's #[code download] and #[code link] commands. All models
    |  follow the naming convention of #[code [language]_[type]_[genre]_[size]].
 +button(gh("spacy-models") + "/releases", true, "primary") View models
 +h(2, "download") Downloading models
@ -95,6 +61,7 @@ p
    # out-of-the-box: download best-matching default model
    python -m spacy download en
    python -m spacy download de
    python -m spacy download fr
    # download best-matching version of specific model for your spaCy installation
    python -m spacy download en_core_web_md
--- a/website/docs/usage/saving-loading.jade
+++ b/website/docs/usage/saving-loading.jade
@ -61,7 +61,7 @@ p
    |  #[+a(gh("spacy-dev-resouces", "templates/model")) spaCy dev resources].
    |  If you're creating the package manually, keep in mind that the directories
    |  need to be named according to the naming conventions of
-    |  #[code [language]_[type]] and #[code [language]_[type]-[version]]. The
+    |  #[code [language]_[name]] and #[code [language]_[name]-[version]]. The
    |  #[code lang] setting in the meta.json is also used to create the
    |  respective #[code Language] class in spaCy, which will later be returned
    |  by the model's #[code load()] method.