Merge branch 'master' into spacy.io

2026-01-08 17:51:16 +03:00 · 2020-03-12 14:45:33 +01:00 · 2020-03-12 14:45:33 +01:00 · 353f8486f5
commit 353f8486f5
parent d6c0746347 1d6aec805d
62 changed files with 1123 additions and 156 deletions
--- a/.github/contributors/dhpollack.md
+++ b/.github/contributors/dhpollack.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [X] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | David Pollack        |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | Mar 5. 2020          |
+| GitHub username                | dhpollack            |
+| Website (optional)             |                      |
--- a/.github/contributors/mabraham.md
+++ b/.github/contributors/mabraham.md
@ -0,0 +1,89 @@
+
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+        assignment is or becomes invalid, ineffective or unenforceable, you hereby
+            grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+                royalty-free, unrestricted license to exercise all rights under those
+                    copyrights. This includes, at our option, the right to sublicense these same
+                        rights to third parties through multiple levels of sublicensees or other
+                            licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+        contribution as if each of us were the sole owners, and if one of us makes
+            a derivative work of your contribution, the one who makes the derivative
+                work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+        against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+        exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+        consent of, pay or render an accounting to the other for any use or
+            distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+        your contribution in whole or in part, alone or in combination with or
+            included in any product, work or materials arising out of the project to
+                which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+        multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+        authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+        third party's copyrights, trademarks, patents, or other intellectual
+            property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+        other applicable export and import laws. You agree to notify us if you
+            become aware of any circumstance which would make any of the foregoing
+                representations inaccurate in any respect. We may publicly disclose your
+                    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+        or entity, including my employer, has or will have rights with respect to my
+            contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+        actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |                      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |                      |
+| GitHub username                |                      |
+| Website (optional)             |                      |
--- a/.gitignore
+++ b/.gitignore
@ -5,6 +5,11 @@ corpora/
 keys/
 *.json.gz

+# Tests
+spacy/tests/package/setup.cfg
+spacy/tests/package/pyproject.toml
+spacy/tests/package/requirements.txt
+
 # Website
 website/.cache/
 website/public/
@ -40,6 +45,7 @@ __pycache__/
 .~env/
 .venv
 venv/
+env3.*/
 .dev
 .denv
 .pypyenv
@ -56,6 +62,7 @@ lib64/
 parts/
 sdist/
 var/
+wheelhouse/
 *.egg-info/
 pip-wheel-metadata/
 Pipfile.lock
--- a/47
+++ b/47
@ -1,28 +1,37 @@
 SHELL := /bin/bash
-sha = $(shell "git" "rev-parse" "--short" "HEAD")
-version = $(shell "bin/get-version.sh")
-wheel = spacy-$(version)-cp36-cp36m-linux_x86_64.whl
+PYVER := 3.6
+VENV := ./env$(PYVER)

-dist/spacy.pex : dist/spacy-$(sha).pex
-	cp dist/spacy-$(sha).pex dist/spacy.pex
-	chmod a+rx dist/spacy.pex
+version := $(shell "bin/get-version.sh")

-dist/spacy-$(sha).pex : dist/$(wheel)
-	env3.6/bin/python -m pip install pex==1.5.3
-	env3.6/bin/pex pytest dist/$(wheel) spacy_lookups_data -e spacy -o dist/spacy-$(sha).pex
+dist/spacy-$(version).pex : wheelhouse/spacy-$(version).stamp
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m spacy -o $@ spacy==$(version) jsonschema spacy_lookups_data
+	chmod a+rx $@

-dist/$(wheel) : setup.py spacy/*.py* spacy/*/*.py*
-	python3.6 -m venv env3.6
-	source env3.6/bin/activate
-	env3.6/bin/pip install wheel
-	env3.6/bin/pip install -r requirements.txt --no-cache-dir 
-	env3.6/bin/python setup.py build_ext --inplace
-	env3.6/bin/python setup.py sdist
-	env3.6/bin/python setup.py bdist_wheel
+dist/pytest.pex : wheelhouse/pytest-*.whl
+	$(VENV)/bin/pex -f ./wheelhouse --no-index --disable-cache -m pytest -o $@ pytest pytest-timeout mock
+	chmod a+rx $@

-.PHONY : clean
+wheelhouse/spacy-$(version).stamp : $(VENV)/bin/pex setup.py spacy/*.py* spacy/*/*.py*
+	$(VENV)/bin/pip wheel . -w ./wheelhouse
+	$(VENV)/bin/pip wheel jsonschema spacy_lookups_data -w ./wheelhouse
+	touch $@
+
+wheelhouse/pytest-%.whl : $(VENV)/bin/pex
+	$(VENV)/bin/pip wheel pytest pytest-timeout mock -w ./wheelhouse
+
+$(VENV)/bin/pex :
+	python$(PYVER) -m venv $(VENV)
+	$(VENV)/bin/pip install -U pip setuptools pex wheel
+
+.PHONY : clean test
+
+test : dist/spacy-$(version).pex dist/pytest.pex
+	( . $(VENV)/bin/activate ; \
+	PEX_PATH=dist/spacy-$(version).pex ./dist/pytest.pex --pyargs spacy -x ; )

 clean : setup.py
-	source env3.6/bin/activate
 	rm -rf dist/*
+	rm -rf ./wheelhouse
+	rm -rf $(VENV)
 	python setup.py clean --all
--- a/bin/wiki_entity_linking/README.md
+++ b/bin/wiki_entity_linking/README.md
@ -2,7 +2,7 @@

 ### Step 1: Create a Knowledge Base (KB) and training data

-Run  `wikipedia_pretrain_kb.py` 
+Run  `wikidata_pretrain_kb.py` 
 * This takes as input the locations of a **Wikipedia and a Wikidata dump**, and produces a **KB directory** + **training file**
  * WikiData: get `latest-all.json.bz2` from https://dumps.wikimedia.org/wikidatawiki/entities/
  * Wikipedia: get `enwiki-latest-pages-articles-multistream.xml.bz2` from https://dumps.wikimedia.org/enwiki/latest/ (or for any other language)
--- a/pyproject.toml
+++ b/pyproject.toml
@ -1,3 +1,11 @@
 [build-system]
-requires = ["setuptools"]
+requires = [
+    "setuptools",
+    "wheel",
+    "cython>=0.25",
+    "cymem>=2.0.2,<2.1.0",
+    "preshed>=3.0.2,<3.1.0",
+    "murmurhash>=0.28.0,<1.1.0",
+    "thinc==7.4.0",
+]
 build-backend = "setuptools.build_meta"
--- a/requirements.txt
+++ b/requirements.txt
@ -1,11 +1,11 @@
 # Our libraries
 cymem>=2.0.2,<2.1.0
 preshed>=3.0.2,<3.1.0
-thinc==7.4.0.dev0
+thinc==7.4.0
 blis>=0.4.0,<0.5.0
 murmurhash>=0.28.0,<1.1.0
 wasabi>=0.4.0,<1.1.0
-srsly>=1.0.1,<1.1.0
+srsly>=1.0.2,<1.1.0
 catalogue>=0.0.7,<1.1.0
 # Third party dependencies
 numpy>=1.15.0
--- a/setup.cfg
+++ b/setup.cfg
@ -38,16 +38,16 @@ setup_requires =
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
    murmurhash>=0.28.0,<1.1.0
-    thinc==7.4.0.dev0
+    thinc==7.4.0
 install_requires =
    # Our libraries
    murmurhash>=0.28.0,<1.1.0
    cymem>=2.0.2,<2.1.0
    preshed>=3.0.2,<3.1.0
-    thinc==7.4.0.dev0
+    thinc==7.4.0
    blis>=0.4.0,<0.5.0
    wasabi>=0.4.0,<1.1.0
-    srsly>=1.0.1,<1.1.0
+    srsly>=1.0.2,<1.1.0
    catalogue>=0.0.7,<1.1.0
    # Third-party dependencies
    tqdm>=4.38.0,<5.0.0
@ -59,7 +59,7 @@ install_requires =

 [options.extras_require]
 lookups =
-    spacy_lookups_data>=0.0.5<0.2.0
+    spacy_lookups_data>=0.0.5,<0.2.0
 cuda =
    cupy>=5.0.0b4
 cuda80 =
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -296,8 +296,7 @@ def link_vectors_to_models(vocab):
    key = (ops.device, vectors.name)
    if key in thinc.extra.load_nlp.VECTORS:
        if thinc.extra.load_nlp.VECTORS[key].shape != data.shape:
-            # This is a hack to avoid the problem in #3853. Maybe we should
-            # print a warning as well?
+            # This is a hack to avoid the problem in #3853.
            old_name = vectors.name
            new_name = vectors.name + "_%d" % data.shape[0]
            user_warning(Warnings.W019.format(old=old_name, new=new_name))
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "2.2.3"
+__version__ = "2.2.4.dev0"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@ -23,24 +23,23 @@ BLANK_MODEL_THRESHOLD = 2000


@plac.annotations(
+    # fmt: off
    lang=("model language", "positional", None, str),
    train_path=("location of JSON-formatted training data", "positional", None, Path),
    dev_path=("location of JSON-formatted development data", "positional", None, Path),
+    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
    base_model=("name of model to update (optional)", "option", "b", str),
-    pipeline=(
-        "Comma-separated names of pipeline components to train",
-        "option",
-        "p",
-        str,
-    ),
+    pipeline=("Comma-separated names of pipeline components to train", "option", "p", str),
    ignore_warnings=("Ignore warnings, only show stats and errors", "flag", "IW", bool),
    verbose=("Print additional information and explanations", "flag", "V", bool),
    no_format=("Don't pretty-print the results", "flag", "NF", bool),
+    # fmt: on
 )
 def debug_data(
    lang,
    train_path,
    dev_path,
+    tag_map_path=None,
    base_model=None,
    pipeline="tagger,parser,ner",
    ignore_warnings=False,
@ -60,6 +59,10 @@ def debug_data(
    if not dev_path.exists():
        msg.fail("Development data not found", dev_path, exits=1)

+    tag_map = {}
+    if tag_map_path is not None:
+        tag_map = srsly.read_json(tag_map_path)
+
    # Initialize the model and pipeline
    pipeline = [p.strip() for p in pipeline.split(",")]
    if base_model:
@ -67,6 +70,8 @@ def debug_data(
    else:
        lang_cls = get_lang_class(lang)
        nlp = lang_cls()
+    # Update tag map with provided mapping
+    nlp.vocab.morphology.tag_map.update(tag_map)

    msg.divider("Data format validation")

@ -227,13 +232,17 @@ def debug_data(

        if gold_train_data["ws_ents"]:
            msg.fail(
-                "{} invalid whitespace entity span(s)".format(gold_train_data["ws_ents"])
+                "{} invalid whitespace entity span(s)".format(
+                    gold_train_data["ws_ents"]
+                )
            )
            has_ws_ents_error = True

        if gold_train_data["punct_ents"]:
            msg.warn(
-                "{} entity span(s) with punctuation".format(gold_train_data["punct_ents"])
+                "{} entity span(s) with punctuation".format(
+                    gold_train_data["punct_ents"]
+                )
            )
            has_punct_ents_warning = True

@ -344,7 +353,7 @@ def debug_data(
    if "tagger" in pipeline:
        msg.divider("Part-of-speech Tagging")
        labels = [label for label in gold_train_data["tags"]]
-        tag_map = nlp.Defaults.tag_map
+        tag_map = nlp.vocab.morphology.tag_map
        msg.info(
            "{} {} in data ({} {} in tag map)".format(
                len(labels),
@ -584,7 +593,13 @@ def _compile_gold(train_docs, pipeline):
                if label.startswith(("B-", "U-", "L-")) and doc[i].is_space:
                    # "Illegal" whitespace entity
                    data["ws_ents"] += 1
-                if label.startswith(("B-", "U-", "L-")) and doc[i].text in [".", "'", "!", "?", ","]:
+                if label.startswith(("B-", "U-", "L-")) and doc[i].text in [
+                    ".",
+                    "'",
+                    "!",
+                    "?",
+                    ",",
+                ]:
                    # punctuation entity: could be replaced by whitespace when training with noise,
                    # so add a warning to alert the user to this unexpected side effect.
                    data["punct_ents"] += 1
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -57,6 +57,7 @@ from .. import about
    textcat_multilabel=("Textcat classes aren't mutually exclusive (multilabel)", "flag", "TML", bool),
    textcat_arch=("Textcat model architecture", "option", "ta", str),
    textcat_positive_label=("Textcat positive label for binary classes with two labels", "option", "tpl", str),
+    tag_map_path=("Location of JSON-formatted tag map", "option", "tm", Path),
    verbose=("Display more information for debug", "flag", "VV", bool),
    debug=("Run data diagnostics before training", "flag", "D", bool),
    # fmt: on
@ -95,6 +96,7 @@ def train(
    textcat_multilabel=False,
    textcat_arch="bow",
    textcat_positive_label=None,
+    tag_map_path=None,
    verbose=False,
    debug=False,
 ):
@ -132,6 +134,9 @@ def train(
        output_path.mkdir()
        msg.good("Created output directory: {}".format(output_path))

+    tag_map = {}
+    if tag_map_path is not None:
+        tag_map = srsly.read_json(tag_map_path)
    # Take dropout and batch size as generators of values -- dropout
    # starts high and decays sharply, to force the optimizer to explore.
    # Batch size starts at 1 and grows, so that we make updates quickly
@ -238,6 +243,9 @@ def train(
                pipe_cfg = {}
            nlp.add_pipe(nlp.create_pipe(pipe, config=pipe_cfg))

+    # Update tag map with provided mapping
+    nlp.vocab.morphology.tag_map.update(tag_map)
+
    if vectors:
        msg.text("Loading vector from model '{}'".format(vectors))
        _load_vectors(nlp, vectors)
@ -546,7 +554,30 @@ def train(
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / "model-final"
            nlp.to_disk(final_model_path)
-            final_meta = srsly.read_json(output_path / "model-final" / "meta.json")
+            meta_loc = output_path / "model-final" / "meta.json"
+            final_meta = srsly.read_json(meta_loc)
+            final_meta.setdefault("accuracy", {})
+            final_meta["accuracy"].update(meta.get("accuracy", {}))
+            final_meta.setdefault("speed", {})
+            final_meta["speed"].setdefault("cpu", None)
+            final_meta["speed"].setdefault("gpu", None)
+            # combine cpu and gpu speeds with the base model speeds
+            if final_meta["speed"]["cpu"] and meta["speed"]["cpu"]:
+                speed = _get_total_speed([final_meta["speed"]["cpu"], meta["speed"]["cpu"]])
+                final_meta["speed"]["cpu"] = speed
+            if final_meta["speed"]["gpu"] and meta["speed"]["gpu"]:
+                speed = _get_total_speed([final_meta["speed"]["gpu"], meta["speed"]["gpu"]])
+                final_meta["speed"]["gpu"] = speed
+            # if there were no speeds to update, overwrite with meta
+            if final_meta["speed"]["cpu"] is None and final_meta["speed"]["gpu"] is None:
+                final_meta["speed"].update(meta["speed"])
+            # note: beam speeds are not combined with the base model
+            if has_beam_widths:
+                final_meta.setdefault("beam_accuracy", {})
+                final_meta["beam_accuracy"].update(meta.get("beam_accuracy", {}))
+                final_meta.setdefault("beam_speed", {})
+                final_meta["beam_speed"].update(meta.get("beam_speed", {}))
+            srsly.write_json(meta_loc, final_meta)
        msg.good("Saved model to output directory", final_model_path)
        with msg.loading("Creating best model..."):
            best_model_path = _collate_best_model(final_meta, output_path, best_pipes)
@ -641,11 +672,11 @@ def _get_metrics(component):
    if component == "parser":
        return ("las", "uas", "las_per_type", "token_acc")
    elif component == "tagger":
-        return ("tags_acc",)
+        return ("tags_acc", "token_acc")
    elif component == "ner":
-        return ("ents_f", "ents_p", "ents_r", "ents_per_type")
+        return ("ents_f", "ents_p", "ents_r", "ents_per_type", "token_acc")
    elif component == "textcat":
-        return ("textcat_score",)
+        return ("textcat_score", "token_acc")
    return ("token_acc",)


@ -701,3 +732,12 @@ def _get_progress(
    if beam_width is not None:
        result.insert(1, beam_width)
    return result
+
+
+def _get_total_speed(speeds):
+    seconds_per_word = 0.0
+    for words_per_second in speeds:
+        if words_per_second is None:
+            return None
+        seconds_per_word += 1.0 / words_per_second
+    return 1.0 / seconds_per_word
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -107,6 +107,9 @@ class Warnings(object):
    W027 = ("Found a large training file of {size} bytes. Note that it may "
            "be more efficient to split your training data into multiple "
            "smaller JSON files instead.")
+    W028 = ("Doc.from_array was called with a vector of type '{type}', "
+            "but is expecting one of type 'uint64' instead. This may result "
+            "in problems with the vocab further on in the pipeline.")



@ -541,6 +544,14 @@ class Errors(object):
    E188 = ("Could not match the gold entity links to entities in the doc - "
            "make sure the gold EL data refers to valid results of the "
            "named entity recognizer in the `nlp` pipeline.")
+    E189 = ("Each argument to `get_doc` should be of equal length.")
+    E190 = ("Token head out of range in `Doc.from_array()` for token index "
+            "'{index}' with value '{value}' (equivalent to relative head "
+            "index: '{rel_head_index}'). The head indices should be relative "
+            "to the current token index rather than absolute indices in the "
+            "array.")
+    E191 = ("Invalid head: the head token must be from the same doc as the "
+            "token itself.")


@add_codes
--- a/spacy/gold.pyx
+++ b/spacy/gold.pyx
@ -151,6 +151,8 @@ def align(tokens_a, tokens_b):
    cost = 0
    a2b = numpy.empty(len(tokens_a), dtype="i")
    b2a = numpy.empty(len(tokens_b), dtype="i")
+    a2b.fill(-1)
+    b2a.fill(-1)
    a2b_multi = {}
    b2a_multi = {}
    i = 0
@ -160,7 +162,6 @@ def align(tokens_a, tokens_b):
    while i < len(tokens_a) and j < len(tokens_b):
        a = tokens_a[i][offset_a:]
        b = tokens_b[j][offset_b:]
-        a2b[i] =  b2a[j] = -1
        if a == b:
            if offset_a == offset_b == 0:
                a2b[i] = j
--- a/spacy/lang/de/init.py
+++ b/spacy/lang/de/init.py
@ -3,6 +3,7 @@ from __future__ import unicode_literals

 from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
 from .norm_exceptions import NORM_EXCEPTIONS
+from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_SUFFIXES
 from .punctuation import TOKENIZER_INFIXES
 from .tag_map import TAG_MAP
 from .stop_words import STOP_WORDS
@ -22,6 +23,8 @@ class GermanDefaults(Language.Defaults):
        Language.Defaults.lex_attr_getters[NORM], NORM_EXCEPTIONS, BASE_NORMS
    )
    tokenizer_exceptions = update_exc(BASE_EXCEPTIONS, TOKENIZER_EXCEPTIONS)
+    prefixes = TOKENIZER_PREFIXES
+    suffixes = TOKENIZER_SUFFIXES
    infixes = TOKENIZER_INFIXES
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
--- a/spacy/lang/de/punctuation.py
+++ b/spacy/lang/de/punctuation.py
@ -1,10 +1,32 @@
 # coding: utf8
 from __future__ import unicode_literals

-from ..char_classes import LIST_ELLIPSES, LIST_ICONS
+from ..char_classes import LIST_ELLIPSES, LIST_ICONS, LIST_PUNCT, LIST_QUOTES
+from ..char_classes import LIST_CURRENCY, CURRENCY, UNITS, PUNCT
 from ..char_classes import CONCAT_QUOTES, ALPHA, ALPHA_LOWER, ALPHA_UPPER
+from ..punctuation import _prefixes, _suffixes


+_prefixes = ["``",] + list(_prefixes)
+
+_suffixes = (
+    ["''", "/"]
+    + LIST_PUNCT
+    + LIST_ELLIPSES
+    + LIST_QUOTES
+    + LIST_ICONS
+    + [
+        r"(?<=[0-9])\+",
+        r"(?<=°[FfCcKk])\.",
+        r"(?<=[0-9])(?:{c})".format(c=CURRENCY),
+        r"(?<=[0-9])(?:{u})".format(u=UNITS),
+        r"(?<=[{al}{e}{p}(?:{q})])\.".format(
+            al=ALPHA_LOWER, e=r"%²\-\+", q=CONCAT_QUOTES, p=PUNCT
+        ),
+        r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),
+    ]
+)
+
 _quotes = CONCAT_QUOTES.replace("'", "")

 _infixes = (
@ -15,6 +37,7 @@ _infixes = (
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
+        r"(?<=[0-9{a}])\/(?=[0-9{a}])".format(a=ALPHA),
        r"(?<=[{a}])([{q}\)\]\(\[])(?=[{a}])".format(a=ALPHA, q=_quotes),
        r"(?<=[{a}])--(?=[{a}])".format(a=ALPHA),
        r"(?<=[0-9])-(?=[0-9])",
@ -22,4 +45,6 @@ _infixes = (
 )


+TOKENIZER_PREFIXES = _prefixes
+TOKENIZER_SUFFIXES = _suffixes
 TOKENIZER_INFIXES = _infixes
--- a/spacy/lang/de/tokenizer_exceptions.py
+++ b/spacy/lang/de/tokenizer_exceptions.py
@ -160,6 +160,8 @@ for exc_data in [


 for orth in [
+    "``",
+    "''",
    "A.C.",
    "a.D.",
    "A.D.",
@ -175,10 +177,13 @@ for orth in [
    "biol.",
    "Biol.",
    "ca.",
+    "CDU/CSU",
    "Chr.",
    "Cie.",
+    "c/o",
    "co.",
    "Co.",
+    "d'",
    "D.C.",
    "Dipl.-Ing.",
    "Dipl.",
@ -203,12 +208,18 @@ for orth in [
    "i.G.",
    "i.Tr.",
    "i.V.",
+    "I.",
+    "II.",
+    "III.",
+    "IV.",
+    "Inc.",
    "Ing.",
    "jr.",
    "Jr.",
    "jun.",
    "jur.",
    "K.O.",
+    "L'",
    "L.A.",
    "lat.",
    "M.A.",
--- a/spacy/lang/eu/init.py
+++ b/spacy/lang/eu/init.py
@ -0,0 +1,30 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+from .punctuation import TOKENIZER_SUFFIXES
+from .tag_map import TAG_MAP
+
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...language import Language
+from ...attrs import LANG
+
+
+class BasqueDefaults(Language.Defaults):
+    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
+    lex_attr_getters[LANG] = lambda text: "eu"
+
+    tokenizer_exceptions = BASE_EXCEPTIONS
+    tag_map = TAG_MAP
+    stop_words = STOP_WORDS
+    suffixes = TOKENIZER_SUFFIXES
+
+
+class Basque(Language):
+    lang = "eu"
+    Defaults = BasqueDefaults
+
+
+__all__ = ["Basque"]
--- a/spacy/lang/eu/examples.py
+++ b/spacy/lang/eu/examples.py
@ -0,0 +1,14 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.eu.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+sentences = [
+    "bilbon ko castinga egin da eta nik jakin ez zuetako inork egin al du edota parte hartu duen ezagunik ba al du",
+    "gaur telebistan entzunda denok martetik gatoz hortaz martzianoak gara beno nire ustez batzuk beste batzuk baino martzianoagoak dira"
+]
--- a/spacy/lang/eu/lex_attrs.py
+++ b/spacy/lang/eu/lex_attrs.py
@ -0,0 +1,80 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+# Source http://mylanguages.org/basque_numbers.php
+
+
+_num_words = """
+bat
+bi
+hiru
+lau
+bost
+sei
+zazpi
+zortzi
+bederatzi
+hamar
+hamaika
+hamabi
+hamahiru
+hamalau
+hamabost
+hamasei
+hamazazpi
+Hemezortzi
+hemeretzi
+hogei
+ehun
+mila
+milioi
+""".split()
+
+# source https://www.google.com/intl/ur/inputtools/try/
+
+_ordinal_words = """
+lehen
+bigarren
+hirugarren
+laugarren
+bosgarren
+seigarren
+zazpigarren
+zortzigarren
+bederatzigarren
+hamargarren
+hamaikagarren
+hamabigarren
+hamahirugarren
+hamalaugarren
+hamabosgarren
+hamaseigarren
+hamazazpigarren
+hamazortzigarren
+hemeretzigarren
+hogeigarren
+behin
+""".split()
+
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if text in _num_words:
+        return True
+    if text in _ordinal_words:
+        return True
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/eu/punctuation.py
+++ b/spacy/lang/eu/punctuation.py
@ -0,0 +1,7 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..punctuation import TOKENIZER_SUFFIXES
+
+
+_suffixes = TOKENIZER_SUFFIXES
--- a/spacy/lang/eu/stop_words.py
+++ b/spacy/lang/eu/stop_words.py
@ -0,0 +1,108 @@
+# encoding: utf8
+from __future__ import unicode_literals
+
+# Source: https://github.com/stopwords-iso/stopwords-eu
+# https://www.ranks.nl/stopwords/basque
+# https://www.mustgo.com/worldlanguages/basque/
+STOP_WORDS = set(
+"""
+al
+anitz
+arabera
+asko
+baina
+bat
+batean
+batek
+bati
+batzuei
+batzuek
+batzuetan
+batzuk
+bera
+beraiek
+berau
+berauek
+bere
+berori
+beroriek
+beste
+bezala
+da
+dago
+dira
+ditu
+du
+dute
+edo
+egin
+ere
+eta
+eurak
+ez
+gainera
+gu
+gutxi
+guzti
+haiei
+haiek
+haietan
+hainbeste
+hala
+han
+handik
+hango
+hara
+hari
+hark
+hartan
+hau
+hauei
+hauek
+hauetan
+hemen
+hemendik
+hemengo
+hi
+hona
+honek
+honela
+honetan
+honi
+hor
+hori
+horiei
+horiek
+horietan
+horko
+horra
+horrek
+horrela
+horretan
+horri
+hortik
+hura
+izan
+ni
+noiz
+nola
+non
+nondik
+nongo
+nor
+nora
+ze
+zein
+zen
+zenbait
+zenbat
+zer
+zergatik
+ziren
+zituen
+zu
+zuek
+zuen
+zuten
+""".split()
+)
--- a/spacy/lang/eu/tag_map.py
+++ b/spacy/lang/eu/tag_map.py
@ -0,0 +1,71 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...symbols import POS, PUNCT, SYM, ADJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
+from ...symbols import NOUN, PROPN, PART, INTJ, SPACE, PRON
+
+TAG_MAP = {
+    ".": {POS: PUNCT, "PunctType": "peri"},
+    ",": {POS: PUNCT, "PunctType": "comm"},
+    "-LRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "ini"},
+    "-RRB-": {POS: PUNCT, "PunctType": "brck", "PunctSide": "fin"},
+    "``": {POS: PUNCT, "PunctType": "quot", "PunctSide": "ini"},
+    '""': {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    "''": {POS: PUNCT, "PunctType": "quot", "PunctSide": "fin"},
+    ":": {POS: PUNCT},
+    "$": {POS: SYM, "Other": {"SymType": "currency"}},
+    "#": {POS: SYM, "Other": {"SymType": "numbersign"}},
+    "AFX": {POS: ADJ, "Hyph": "yes"},
+    "CC": {POS: CCONJ, "ConjType": "coor"},
+    "CD": {POS: NUM, "NumType": "card"},
+    "DT": {POS: DET},
+    "EX": {POS: ADV, "AdvType": "ex"},
+    "FW": {POS: X, "Foreign": "yes"},
+    "HYPH": {POS: PUNCT, "PunctType": "dash"},
+    "IN": {POS: ADP},
+    "JJ": {POS: ADJ, "Degree": "pos"},
+    "JJR": {POS: ADJ, "Degree": "comp"},
+    "JJS": {POS: ADJ, "Degree": "sup"},
+    "LS": {POS: PUNCT, "NumType": "ord"},
+    "MD": {POS: VERB, "VerbType": "mod"},
+    "NIL": {POS: ""},
+    "NN": {POS: NOUN, "Number": "sing"},
+    "NNP": {POS: PROPN, "NounType": "prop", "Number": "sing"},
+    "NNPS": {POS: PROPN, "NounType": "prop", "Number": "plur"},
+    "NNS": {POS: NOUN, "Number": "plur"},
+    "PDT": {POS: ADJ, "AdjType": "pdt", "PronType": "prn"},
+    "POS": {POS: PART, "Poss": "yes"},
+    "PRP": {POS: PRON, "PronType": "prs"},
+    "PRP$": {POS: ADJ, "PronType": "prs", "Poss": "yes"},
+    "RB": {POS: ADV, "Degree": "pos"},
+    "RBR": {POS: ADV, "Degree": "comp"},
+    "RBS": {POS: ADV, "Degree": "sup"},
+    "RP": {POS: PART},
+    "SP": {POS: SPACE},
+    "SYM": {POS: SYM},
+    "TO": {POS: PART, "PartType": "inf", "VerbForm": "inf"},
+    "UH": {POS: INTJ},
+    "VB": {POS: VERB, "VerbForm": "inf"},
+    "VBD": {POS: VERB, "VerbForm": "fin", "Tense": "past"},
+    "VBG": {POS: VERB, "VerbForm": "part", "Tense": "pres", "Aspect": "prog"},
+    "VBN": {POS: VERB, "VerbForm": "part", "Tense": "past", "Aspect": "perf"},
+    "VBP": {POS: VERB, "VerbForm": "fin", "Tense": "pres"},
+    "VBZ": {
+        POS: VERB,
+        "VerbForm": "fin",
+        "Tense": "pres",
+        "Number": "sing",
+        "Person": 3,
+    },
+    "WDT": {POS: ADJ, "PronType": "int|rel"},
+    "WP": {POS: NOUN, "PronType": "int|rel"},
+    "WP$": {POS: ADJ, "Poss": "yes", "PronType": "int|rel"},
+    "WRB": {POS: ADV, "PronType": "int|rel"},
+    "ADD": {POS: X},
+    "NFP": {POS: PUNCT},
+    "GW": {POS: X},
+    "XX": {POS: X},
+    "BES": {POS: VERB},
+    "HVS": {POS: VERB},
+    "_SP": {POS: SPACE},
+}
--- a/spacy/lang/lb/punctuation.py
+++ b/spacy/lang/lb/punctuation.py
@ -5,11 +5,13 @@ from ..char_classes import LIST_ELLIPSES, LIST_ICONS, ALPHA, ALPHA_LOWER, ALPHA_

 ELISION = " ' ’ ".strip().replace(" ", "")

+abbrev = ("d", "D")
+
 _infixes = (
    LIST_ELLIPSES
    + LIST_ICONS
    + [
-        r"(?<=[{a}][{el}])(?=[{a}])".format(a=ALPHA, el=ELISION),
+        r"(?<=^[{ab}][{el}])(?=[{a}])".format(ab=abbrev, a=ALPHA, el=ELISION),
        r"(?<=[{al}])\.(?=[{au}])".format(al=ALPHA_LOWER, au=ALPHA_UPPER),
        r"(?<=[{a}])[,!?](?=[{a}])".format(a=ALPHA),
        r"(?<=[{a}])[:<>=](?=[{a}])".format(a=ALPHA),
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -10,6 +10,8 @@ _exc = {}

 # translate / delete what is not necessary
 for exc_data in [
+    {ORTH: "’t", LEMMA: "et", NORM: "et"},
+    {ORTH: "’T", LEMMA: "et", NORM: "et"},
    {ORTH: "'t", LEMMA: "et", NORM: "et"},
    {ORTH: "'T", LEMMA: "et", NORM: "et"},
    {ORTH: "wgl.", LEMMA: "wannechgelift", NORM: "wannechgelift"},
--- a/spacy/language.py
+++ b/spacy/language.py
@ -3,6 +3,9 @@ from __future__ import absolute_import, unicode_literals

 import random
 import itertools
+
+from thinc.extra import load_nlp
+
 from spacy.util import minibatch
 import weakref
 import functools
@ -15,6 +18,7 @@ import multiprocessing as mp
 from itertools import chain, cycle

 from .tokenizer import Tokenizer
+from .tokens.underscore import Underscore
 from .vocab import Vocab
 from .lemmatizer import Lemmatizer
 from .lookups import Lookups
@ -753,8 +757,6 @@ class Language(object):

        DOCS: https://spacy.io/api/language#pipe
        """
-        # raw_texts will be used later to stop iterator.
-        texts, raw_texts = itertools.tee(texts)
        if is_python2 and n_process != 1:
            user_warning(Warnings.W023)
            n_process = 1
@ -853,7 +855,10 @@ class Language(object):
        sender.send()

        procs = [
-            mp.Process(target=_apply_pipes, args=(self.make_doc, pipes, rch, sch))
+            mp.Process(
+                target=_apply_pipes,
+                args=(self.make_doc, pipes, rch, sch, Underscore.get_state(), load_nlp.VECTORS),
+            )
            for rch, sch in zip(texts_q, bytedocs_send_ch)
        ]
        for proc in procs:
@ -1108,16 +1113,20 @@ def _pipe(docs, proc, kwargs):
        yield doc


-def _apply_pipes(make_doc, pipes, reciever, sender):
+def _apply_pipes(make_doc, pipes, receiver, sender, underscore_state, vectors):
    """Worker for Language.pipe

    receiver (multiprocessing.Connection): Pipe to receive text. Usually
        created by `multiprocessing.Pipe()`
    sender (multiprocessing.Connection): Pipe to send doc. Usually created by
        `multiprocessing.Pipe()`
+    underscore_state (tuple): The data in the Underscore class of the parent
+    vectors (dict): The global vectors data, copied from the parent
    """
+    Underscore.load_state(underscore_state)
+    load_nlp.VECTORS = vectors
    while True:
-        texts = reciever.get()
+        texts = receiver.get()
        docs = (make_doc(text) for text in texts)
        for pipe in pipes:
            docs = pipe(docs)
--- a/spacy/matcher/_schemas.py
+++ b/spacy/matcher/_schemas.py
@ -170,6 +170,10 @@ TOKEN_PATTERN_SCHEMA = {
                "title": "Token is the first in a sentence",
                "$ref": "#/definitions/boolean_value",
            },
+            "SENT_START": {
+                "title": "Token is the first in a sentence",
+                "$ref": "#/definitions/boolean_value",
+            },
            "LIKE_NUM": {
                "title": "Token resembles a number",
                "$ref": "#/definitions/boolean_value",
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@ -670,6 +670,8 @@ def _get_attr_values(spec, string_store):
                continue
            if attr == "TEXT":
                attr = "ORTH"
+            if attr == "IS_SENT_START":
+                attr = "SENT_START"
            if attr not in TOKEN_PATTERN_SCHEMA["items"]["properties"]:
                raise ValueError(Errors.E152.format(attr=attr))
            attr = IDS.get(attr)
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -367,7 +367,7 @@ class Tensorizer(Pipe):
        return sgd


-@component("tagger", assigns=["token.tag", "token.pos"])
+@component("tagger", assigns=["token.tag", "token.pos", "token.lemma"])
 class Tagger(Pipe):
    """Pipeline component for part-of-speech tagging.

--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -606,7 +606,6 @@ cdef class Parser:
        if not hasattr(get_gold_tuples, '__call__'):
            gold_tuples = get_gold_tuples
            get_gold_tuples = lambda: gold_tuples
-        cfg.setdefault('min_action_freq', 30)
        actions = self.moves.get_actions(gold_parses=get_gold_tuples(),
                                         min_freq=cfg.get('min_action_freq', 30),
                                         learn_tokens=self.cfg.get("learn_tokens", False))
@ -616,8 +615,9 @@ cdef class Parser:
                if label not in actions[action]:
                    actions[action][label] = freq
        self.moves.initialize_actions(actions)
-        cfg.setdefault('token_vector_width', 96)
        if self.model is True:
+            cfg.setdefault('min_action_freq', 30)
+            cfg.setdefault('token_vector_width', 96)
            self.model, cfg = self.Model(self.moves.n_moves, **cfg)
            if sgd is None:
                sgd = self.create_optimizer()
@ -633,11 +633,11 @@ cdef class Parser:
            if pipeline is not None:
                self.init_multitask_objectives(get_gold_tuples, pipeline, sgd=sgd, **cfg)
            link_vectors_to_models(self.vocab)
+            self.cfg.update(cfg)
        else:
            if sgd is None:
                sgd = self.create_optimizer()
            self.model.begin_training([])
-        self.cfg.update(cfg)
        return sgd

    def to_disk(self, path, exclude=tuple(), **kwargs):
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -83,6 +83,11 @@ def es_tokenizer():
    return get_lang_class("es").Defaults.create_tokenizer()


+@pytest.fixture(scope="session")
+def eu_tokenizer():
+    return get_lang_class("eu").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
 def fi_tokenizer():
    return get_lang_class("fi").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_array.py
+++ b/spacy/tests/doc/test_array.py
@ -77,3 +77,30 @@ def test_doc_array_idx(en_vocab):
    assert offsets[0] == 0
    assert offsets[1] == 3
    assert offsets[2] == 11
+
+
+def test_doc_from_array_heads_in_bounds(en_vocab):
+    """Test that Doc.from_array doesn't set heads that are out of bounds."""
+    words = ["This", "is", "a", "sentence", "."]
+    doc = Doc(en_vocab, words=words)
+    for token in doc:
+        token.head = doc[0]
+
+    # correct
+    arr = doc.to_array(["HEAD"])
+    doc_from_array = Doc(en_vocab, words=words)
+    doc_from_array.from_array(["HEAD"], arr)
+
+    # head before start
+    arr = doc.to_array(["HEAD"])
+    arr[0] = -1
+    doc_from_array = Doc(en_vocab, words=words)
+    with pytest.raises(ValueError):
+        doc_from_array.from_array(["HEAD"], arr)
+
+    # head after end
+    arr = doc.to_array(["HEAD"])
+    arr[0] = 5
+    doc_from_array = Doc(en_vocab, words=words)
+    with pytest.raises(ValueError):
+        doc_from_array.from_array(["HEAD"], arr)
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@ -150,10 +150,9 @@ def test_doc_api_runtime_error(en_tokenizer):
    # Example that caused run-time error while parsing Reddit
    # fmt: off
    text = "67% of black households are single parent \n\n72% of all black babies born out of wedlock \n\n50% of all black kids don\u2019t finish high school"
-    deps = ["nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "",
-            "nummod", "prep", "det", "amod", "pobj", "acl", "prep", "prep",
-            "pobj", "", "nummod", "prep", "det", "amod", "pobj", "aux", "neg",
-            "ROOT", "amod", "dobj"]
+    deps = ["nummod", "nsubj", "prep", "amod", "pobj", "ROOT", "amod", "attr", "", "nummod", "appos", "prep", "det",
+            "amod", "pobj", "acl", "prep", "prep", "pobj",
+            "", "nummod", "nsubj", "prep", "det", "amod", "pobj", "aux", "neg", "ccomp", "amod", "dobj"]
    # fmt: on
    tokens = en_tokenizer(text)
    doc = get_doc(tokens.vocab, words=[t.text for t in tokens], deps=deps)
@ -277,7 +276,9 @@ def test_doc_is_nered(en_vocab):
 def test_doc_from_array_sent_starts(en_vocab):
    words = ["I", "live", "in", "New", "York", ".", "I", "like", "cats", "."]
    heads = [0, 0, 0, 0, 0, 0, 6, 6, 6, 6]
+    # fmt: off
    deps = ["ROOT", "dep", "dep", "dep", "dep", "dep", "ROOT", "dep", "dep", "dep", "dep"]
+    # fmt: on
    doc = Doc(en_vocab, words=words)
    for i, (dep, head) in enumerate(zip(deps, heads)):
        doc[i].dep_ = dep
--- a/spacy/tests/doc/test_token_api.py
+++ b/spacy/tests/doc/test_token_api.py
@ -167,6 +167,11 @@ def test_doc_token_api_head_setter(en_tokenizer):
    assert doc[4].left_edge.i == 0
    assert doc[2].left_edge.i == 0

+    # head token must be from the same document
+    doc2 = get_doc(tokens.vocab, words=[t.text for t in tokens], heads=heads)
+    with pytest.raises(ValueError):
+        doc[0].head = doc2[0]
+

 def test_is_sent_start(en_tokenizer):
    doc = en_tokenizer("This is a sentence. This is another.")
@ -214,7 +219,7 @@ def test_token_api_conjuncts_chain(en_vocab):
 def test_token_api_conjuncts_simple(en_vocab):
    words = "They came and went .".split()
    heads = [1, 0, -1, -2, -1]
-    deps = ["nsubj", "ROOT", "cc", "conj"]
+    deps = ["nsubj", "ROOT", "cc", "conj", "dep"]
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert [w.text for w in doc[1].conjuncts] == ["went"]
    assert [w.text for w in doc[3].conjuncts] == ["came"]
--- a/spacy/tests/doc/test_underscore.py
+++ b/spacy/tests/doc/test_underscore.py
@ -7,6 +7,15 @@ from spacy.tokens import Doc, Span, Token
 from spacy.tokens.underscore import Underscore


+@pytest.fixture(scope="function", autouse=True)
+def clean_underscore():
+    # reset the Underscore object after the test, to avoid having state copied across tests
+    yield
+    Underscore.doc_extensions = {}
+    Underscore.span_extensions = {}
+    Underscore.token_extensions = {}
+
+
 def test_create_doc_underscore():
    doc = Mock()
    doc.doc = doc
--- a/spacy/tests/lang/eu/test_text.py
+++ b/spacy/tests/lang/eu/test_text.py
@ -0,0 +1,16 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_eu_tokenizer_handles_long_text(eu_tokenizer):
+    text = """ta nere guitarra estrenatu ondoren"""
+    tokens = eu_tokenizer(text)
+    assert len(tokens) == 5
+
+
+@pytest.mark.parametrize("text,length", [("milesker ederra joan zen hitzaldia plazer hutsa", 7), ("astelehen guztia sofan pasau biot", 5)])
+def test_eu_tokenizer_handles_cnts(eu_tokenizer, text, length):
+    tokens = eu_tokenizer(text)
+    assert len(tokens) == length
--- a/spacy/tests/matcher/test_matcher_api.py
+++ b/spacy/tests/matcher/test_matcher_api.py
@ -6,6 +6,7 @@ import re
 from mock import Mock
 from spacy.matcher import Matcher, DependencyMatcher
 from spacy.tokens import Doc, Token
+from ..doc.test_underscore import clean_underscore


@pytest.fixture
@ -200,6 +201,7 @@ def test_matcher_any_token_operator(en_vocab):
    assert matches[2] == "test hello world"


+@pytest.mark.usefixtures("clean_underscore")
 def test_matcher_extension_attribute(en_vocab):
    matcher = Matcher(en_vocab)
    get_is_fruit = lambda token: token.text in ("apple", "banana")
--- a/spacy/tests/matcher/test_pattern_validation.py
+++ b/spacy/tests/matcher/test_pattern_validation.py
@ -34,6 +34,8 @@ TEST_PATTERNS = [
    ([{"LOWER": {"REGEX": "^X", "NOT_IN": ["XXX", "XY"]}}], 0, 0),
    ([{"NORM": "a"}, {"POS": {"IN": ["NOUN"]}}], 0, 0),
    ([{"_": {"foo": {"NOT_IN": ["bar", "baz"]}, "a": 5, "b": {">": 10}}}], 0, 0),
+    ([{"IS_SENT_START": True}], 0, 0),
+    ([{"SENT_START": True}], 0, 0),
 ]

 XFAIL_TEST_PATTERNS = [([{"orth": "foo"}], 0, 0)]
--- a/spacy/tests/parser/test_parse_navigate.py
+++ b/spacy/tests/parser/test_parse_navigate.py
@ -34,23 +34,23 @@ BIG BROTHER IS WATCHING YOU, the caption beneath it ran.
@pytest.fixture
 def heads():
    # fmt: off
-    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, -10, 2, 1, -3, -1, -15,
-            -1, 1, 4, -1, 1, -3, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
-            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, 3, 1, 1, -14,
-            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 2, 1,
-            0, -1, 1, -2, -1, 2, 1, -4, -8, 0, 1, -2, -1, -1, 3, -1, 1, -6,
-            9, 1, 7, -1, 1, -2, 3, 2, 1, -10, -1, 1, -2, -22, -1, 1, 0, -1,
-            2, 1, -4, -1, -2, -1, 1, -2, -6, -7, 1, -9, -1, 2, -1, -3, -1,
-            3, 2, 1, -4, -19, -24, 3, 2, 1, -4, -1, 1, 2, -1, -5, -34, 1, 0,
-            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, -3, -1,
-            -1, 3, 2, 1, 0, -1, -2, 7, -1, 5, 1, 3, -1, 1, -10, -1, -2, 1,
-            -2, -15, 1, 0, -1, -1, 2, 1, -3, -1, -1, -2, -1, 1, -2, -12, 1,
-            1, 0, 1, -2, -1, -2, -3, 9, -1, 2, -1, -4, 2, 1, -3, -4, -15, 2,
-            1, -3, -1, 2, 1, -3, -8, -9, -1, -2, -1, -4, 1, -2, -3, 1, -2,
-            -19, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
+    return [1, 1, 0, 3, 2, 1, -4, -1, -1, -7, -8, 1, 2, 1, -12, -1, -2,
+            -1, 1, 4, 3, 1, 1, 0, -1, 1, -2, -4, 1, -2, 1, -2, 3, -1, 1,
+            -4, -13, -14, -1, -2, 2, 1, -3, -1, 1, -2, -9, -1, -11, 1, 1, -14,
+            1, -2, 1, -2, -1, 1, -2, -6, -1, -1, -2, -1, -1, -42, -1, 1, 1,
+            0, -1, 1, -2, -1, 2, 1, -4, -8, 18, 1, -2, -1, -1, 3, -1, 1, 10,
+            9, 1, 7, -1, 1, -2, 3, 2, 1, 0, -1, 1, -2, -4, -1, 1, 0, -1,
+            2, 1, -4, -1, 2, 1, 1, 1, -6, -11, 1, 20, -1, 2, -1, -3, -1,
+            3, 2, 1, -4, -10, -11, 3, 2, 1, -4, -1, 1, -3, -1, 0, -1, 1, 0,
+            -1, 1, -2, -4, 1, 0, 1, -2, -1, 1, -2, -6, 1, 9, -1, 1, 6, -1,
+            -1, 3, 2, 1, 0, -1, -2, 7, -1, 2, 1, 3, -1, 1, -10, -1, -2, 1,
+            -2, -5, 1, 0, -1, -1, 1, -2, -5, -1, -1, -2, -1, 1, -2, -12, 1,
+            1, 0, 1, -2, -1, -4, -5, 18, -1, 2, -1, -4, 2, 1, -3, -4, -5, 2,
+            1, -3, -1, 2, 1, -3, -17, -24, -1, -2, -1, -4, 1, -2, -3, 1, -2,
+            -10, 17, 1, -2, 14, 13, 3, 2, 1, -4, 8, -1, 1, 5, -1, 2, 1, -3,
            0, -1, 1, -2, -4, 1, 0, -1, -1, 2, -1, -3, 1, -2, 1, -2, 3, 1,
-            1, -4, -1, -2, 2, 1, -5, -19, -1, 1, 1, 0, 1, 6, -1, 1, -3, -1,
-            -1, -8, -9, -1]
+            1, -4, -1, -2, 2, 1, -3, -19, -1, 1, 1, 0, 0, 6, 5, 1, 3, -1,
+            -1, 0, -1, -1]
    # fmt: on


--- a/spacy/tests/regression/test_issue2001-2500.py
+++ b/spacy/tests/regression/test_issue2001-2500.py
@ -48,7 +48,7 @@ def test_issue2203(en_vocab):
    tag_ids = [en_vocab.strings.add(tag) for tag in tags]
    lemma_ids = [en_vocab.strings.add(lemma) for lemma in lemmas]
    doc = Doc(en_vocab, words=words)
-    # Work around lemma corrpution problem and set lemmas after tags
+    # Work around lemma corruption problem and set lemmas after tags
    doc.from_array("TAG", numpy.array(tag_ids, dtype="uint64"))
    doc.from_array("LEMMA", numpy.array(lemma_ids, dtype="uint64"))
    assert [t.tag_ for t in doc] == tags
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -124,7 +124,7 @@ def test_issue2772(en_vocab):
    words = "When we write or communicate virtually , we can hide our true feelings .".split()
    # A tree with a non-projective (i.e. crossing) arc
    # The arcs (0, 4) and (2, 9) cross.
-    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, -1, -2, -1]
+    heads = [4, 1, 7, -1, -2, -1, 3, 2, 1, 0, 2, 1, -3, -4]
    deps = ["dep"] * len(heads)
    doc = get_doc(en_vocab, words=words, heads=heads, deps=deps)
    assert doc[1].is_sent_start is None
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -27,7 +27,7 @@ def test_issue4590(en_vocab):

    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
-    deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
+    deps = ["det", "amod", "amod", "nsubj", "ROOT", "prep", "det", "amod", "pobj"]

    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)

--- a/spacy/tests/regression/test_issue4725.py
+++ b/spacy/tests/regression/test_issue4725.py
@ -0,0 +1,26 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy
+
+from spacy.lang.en import English
+from spacy.vocab import Vocab
+
+
+def test_issue4725():
+    # ensures that this runs correctly and doesn't hang or crash because of the global vectors
+    vocab = Vocab(vectors_name="test_vocab_add_vector")
+    data = numpy.ndarray((5, 3), dtype="f")
+    data[0] = 1.0
+    data[1] = 2.0
+    vocab.set_vector("cat", data[0])
+    vocab.set_vector("dog", data[1])
+
+    nlp = English(vocab=vocab)
+    ner = nlp.create_pipe("ner")
+    nlp.add_pipe(ner)
+    nlp.begin_training()
+    docs = ["Kurt is in London."] * 10
+    for _ in nlp.pipe(docs, batch_size=2, n_process=2):
+        pass
+
--- a/spacy/tests/regression/test_issue4903.py
+++ b/spacy/tests/regression/test_issue4903.py
@ -0,0 +1,43 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from spacy.lang.en import English
+from spacy.tokens import Span, Doc
+
+
+class CustomPipe:
+    name = "my_pipe"
+
+    def __init__(self):
+        Span.set_extension("my_ext", getter=self._get_my_ext)
+        Doc.set_extension("my_ext", default=None)
+
+    def __call__(self, doc):
+        gathered_ext = []
+        for sent in doc.sents:
+            sent_ext = self._get_my_ext(sent)
+            sent._.set("my_ext", sent_ext)
+            gathered_ext.append(sent_ext)
+
+        doc._.set("my_ext", "\n".join(gathered_ext))
+
+        return doc
+
+    @staticmethod
+    def _get_my_ext(span):
+        return str(span.end)
+
+
+def test_issue4903():
+    # ensures that this runs correctly and doesn't hang or crash on Windows / macOS
+
+    nlp = English()
+    custom_component = CustomPipe()
+    nlp.add_pipe(nlp.create_pipe("sentencizer"))
+    nlp.add_pipe(custom_component, after="sentencizer")
+
+    text = ["I like bananas.", "Do you like them?", "No, I prefer wasabi."]
+    docs = list(nlp.pipe(text, n_process=2))
+    assert docs[0].text == "I like bananas."
+    assert docs[1].text == "Do you like them?"
+    assert docs[2].text == "No, I prefer wasabi."
--- a/spacy/tests/regression/test_issue4924.py
+++ b/spacy/tests/regression/test_issue4924.py
@ -11,6 +11,6 @@ def nlp():
    return spacy.blank("en")


-def test_evaluate(nlp):
+def test_issue4924(nlp):
    docs_golds = [("", {})]
    nlp.evaluate(docs_golds)
--- a/spacy/tests/regression/test_issue5048.py
+++ b/spacy/tests/regression/test_issue5048.py
@ -0,0 +1,35 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy
+from spacy.tokens import Doc
+from spacy.attrs import DEP, POS, TAG
+
+from ..util import get_doc
+
+
+def test_issue5048(en_vocab):
+    words = ["This", "is", "a", "sentence"]
+    pos_s = ["DET", "VERB", "DET", "NOUN"]
+    spaces = [" ", " ", " ", ""]
+    deps_s = ["dep", "adj", "nn", "atm"]
+    tags_s = ["DT", "VBZ", "DT", "NN"]
+
+    strings = en_vocab.strings
+
+    for w in words:
+        strings.add(w)
+    deps = [strings.add(d) for d in deps_s]
+    pos = [strings.add(p) for p in pos_s]
+    tags = [strings.add(t) for t in tags_s]
+
+    attrs = [POS, DEP, TAG]
+    array = numpy.array(list(zip(pos, deps, tags)), dtype="uint64")
+
+    doc = Doc(en_vocab, words=words, spaces=spaces)
+    doc.from_array(attrs, array)
+    v1 = [(token.text, token.pos_, token.tag_) for token in doc]
+
+    doc2 = get_doc(en_vocab, words=words, pos=pos_s, deps=deps_s, tags=tags_s)
+    v2 = [(token.text, token.pos_, token.tag_) for token in doc2]
+    assert v1 == v2
--- a/spacy/tests/regression/test_issue5082.py
+++ b/spacy/tests/regression/test_issue5082.py
@ -0,0 +1,46 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import numpy as np
+from spacy.lang.en import English
+from spacy.pipeline import EntityRuler
+
+
+def test_issue5082():
+    # Ensure the 'merge_entities' pipeline does something sensible for the vectors of the merged tokens
+    nlp = English()
+    vocab = nlp.vocab
+    array1 = np.asarray([0.1, 0.5, 0.8], dtype=np.float32)
+    array2 = np.asarray([-0.2, -0.6, -0.9], dtype=np.float32)
+    array3 = np.asarray([0.3, -0.1, 0.7], dtype=np.float32)
+    array4 = np.asarray([0.5, 0, 0.3], dtype=np.float32)
+    array34 = np.asarray([0.4, -0.05, 0.5], dtype=np.float32)
+
+    vocab.set_vector("I", array1)
+    vocab.set_vector("like", array2)
+    vocab.set_vector("David", array3)
+    vocab.set_vector("Bowie", array4)
+
+    text = "I like David Bowie"
+    ruler = EntityRuler(nlp)
+    patterns = [
+        {"label": "PERSON", "pattern": [{"LOWER": "david"}, {"LOWER": "bowie"}]}
+    ]
+    ruler.add_patterns(patterns)
+    nlp.add_pipe(ruler)
+
+    parsed_vectors_1 = [t.vector for t in nlp(text)]
+    assert len(parsed_vectors_1) == 4
+    np.testing.assert_array_equal(parsed_vectors_1[0], array1)
+    np.testing.assert_array_equal(parsed_vectors_1[1], array2)
+    np.testing.assert_array_equal(parsed_vectors_1[2], array3)
+    np.testing.assert_array_equal(parsed_vectors_1[3], array4)
+
+    merge_ents = nlp.create_pipe("merge_entities")
+    nlp.add_pipe(merge_ents)
+
+    parsed_vectors_2 = [t.vector for t in nlp(text)]
+    assert len(parsed_vectors_2) == 3
+    np.testing.assert_array_equal(parsed_vectors_2[0], array1)
+    np.testing.assert_array_equal(parsed_vectors_2[1], array2)
+    np.testing.assert_array_equal(parsed_vectors_2[2], array34)
--- a/spacy/tests/serialize/test_serialize_tokenizer.py
+++ b/spacy/tests/serialize/test_serialize_tokenizer.py
@ -15,12 +15,19 @@ def load_tokenizer(b):


 def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
-    """Test that custom tokenizer with not all functions defined can be
-    serialized and deserialized correctly (see #2494)."""
+    """Test that custom tokenizer with not all functions defined or empty
+    properties can be serialized and deserialized correctly (see #2494,
+    #4991)."""
    tokenizer = Tokenizer(en_vocab, suffix_search=en_tokenizer.suffix_search)
    tokenizer_bytes = tokenizer.to_bytes()
    Tokenizer(en_vocab).from_bytes(tokenizer_bytes)

+    tokenizer = Tokenizer(en_vocab, rules={"ABC.": [{"ORTH": "ABC", "ORTH": "."}]})
+    tokenizer.rules = {}
+    tokenizer_bytes = tokenizer.to_bytes()
+    tokenizer_reloaded = Tokenizer(en_vocab).from_bytes(tokenizer_bytes)
+    assert tokenizer_reloaded.rules == {}
+

@pytest.mark.skip(reason="Currently unreliable across platforms")
@pytest.mark.parametrize("text", ["I💜you", "they’re", "“hello”"])
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@ -31,10 +31,10 @@ def test_displacy_parse_deps(en_vocab):
    deps = displacy.parse_deps(doc)
    assert isinstance(deps, dict)
    assert deps["words"] == [
-        {"lemma": None, "text": "This", "tag": "DET"},
-        {"lemma": None, "text": "is", "tag": "AUX"},
-        {"lemma": None, "text": "a", "tag": "DET"},
-        {"lemma": None, "text": "sentence", "tag": "NOUN"},
+        {"lemma": None, "text": words[0], "tag": pos[0]},
+        {"lemma": None, "text": words[1], "tag": pos[1]},
+        {"lemma": None, "text": words[2], "tag": pos[2]},
+        {"lemma": None, "text": words[3], "tag": pos[3]},
    ]
    assert deps["arcs"] == [
        {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
@ -75,7 +75,7 @@ def test_displacy_rtl():
    deps = ["foo", "bar", "foo", "baz"]
    heads = [1, 0, 1, -2]
    nlp = Persian()
-    doc = get_doc(nlp.vocab, words=words, pos=pos, tags=pos, heads=heads, deps=deps)
+    doc = get_doc(nlp.vocab, words=words, tags=pos, heads=heads, deps=deps)
    doc.ents = [Span(doc, 1, 3, label="TEST")]
    html = displacy.render(doc, page=True, style="dep")
    assert "direction: rtl" in html
--- a/spacy/tests/util.py
+++ b/spacy/tests/util.py
@ -7,8 +7,10 @@ import shutil
 import contextlib
 import srsly
 from pathlib import Path
+
+from spacy import Errors
 from spacy.tokens import Doc, Span
-from spacy.attrs import POS, HEAD, DEP
+from spacy.attrs import POS, TAG, HEAD, DEP, LEMMA
 from spacy.compat import path2str


@ -26,30 +28,54 @@ def make_tempdir():
    shutil.rmtree(path2str(d))


-def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None):
+def get_doc(vocab, words=[], pos=None, heads=None, deps=None, tags=None, ents=None, lemmas=None):
    """Create Doc object from given vocab, words and annotations."""
-    pos = pos or [""] * len(words)
-    tags = tags or [""] * len(words)
-    heads = heads or [0] * len(words)
-    deps = deps or [""] * len(words)
-    for value in deps + tags + pos:
+    if deps and not heads:
+        heads = [0] * len(deps)
+    headings = []
+    values = []
+    annotations = [pos, heads, deps, lemmas, tags]
+    possible_headings = [POS, HEAD, DEP, LEMMA, TAG]
+    for a, annot in enumerate(annotations):
+        if annot is not None:
+            if len(annot) != len(words):
+                raise ValueError(Errors.E189)
+            headings.append(possible_headings[a])
+            if annot is not heads:
+                values.extend(annot)
+    for value in values:
        vocab.strings.add(value)

    doc = Doc(vocab, words=words)
-    attrs = doc.to_array([POS, HEAD, DEP])
-    for i, (p, head, dep) in enumerate(zip(pos, heads, deps)):
-        attrs[i, 0] = doc.vocab.strings[p]
-        attrs[i, 1] = head
-        attrs[i, 2] = doc.vocab.strings[dep]
-    doc.from_array([POS, HEAD, DEP], attrs)
+
+    # if there are any other annotations, set them
+    if headings:
+        attrs = doc.to_array(headings)
+
+        j = 0
+        for annot in annotations:
+            if annot:
+                if annot is heads:
+                    for i in range(len(words)):
+                        if attrs.ndim == 1:
+                            attrs[i] = heads[i]
+                        else:
+                            attrs[i,j] = heads[i]
+                else:
+                    for i in range(len(words)):
+                        if attrs.ndim == 1:
+                            attrs[i] = doc.vocab.strings[annot[i]]
+                        else:
+                            attrs[i, j] = doc.vocab.strings[annot[i]]
+                j += 1
+        doc.from_array(headings, attrs)
+
+    # finally, set the entities
    if ents:
        doc.ents = [
            Span(doc, start, end, label=doc.vocab.strings[label])
            for start, end, label in ents
        ]
-    if tags:
-        for token in doc:
-            token.tag_ = tags[token.i]
    return doc


--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -14,7 +14,7 @@ import re

 from .tokens.doc cimport Doc
 from .strings cimport hash_string
-from .compat import unescape_unicode
+from .compat import unescape_unicode, basestring_
 from .attrs import intify_attrs
 from .symbols import ORTH

@ -508,6 +508,7 @@ cdef class Tokenizer:

        DOCS: https://spacy.io/api/tokenizer#to_disk
        """
+        path = util.ensure_path(path)
        with path.open("wb") as file_:
            file_.write(self.to_bytes(**kwargs))

@ -521,6 +522,7 @@ cdef class Tokenizer:

        DOCS: https://spacy.io/api/tokenizer#from_disk
        """
+        path = util.ensure_path(path)
        with path.open("rb") as file_:
            bytes_data = file_.read()
        self.from_bytes(bytes_data, **kwargs)
@ -568,22 +570,22 @@ cdef class Tokenizer:
        for key in ["prefix_search", "suffix_search", "infix_finditer"]:
            if key in data:
                data[key] = unescape_unicode(data[key])
-        if data.get("prefix_search"):
+        if "prefix_search" in data and isinstance(data["prefix_search"], basestring_):
            self.prefix_search = re.compile(data["prefix_search"]).search
-        if data.get("suffix_search"):
+        if "suffix_search" in data and isinstance(data["suffix_search"], basestring_):
            self.suffix_search = re.compile(data["suffix_search"]).search
-        if data.get("infix_finditer"):
+        if "infix_finditer" in data and isinstance(data["infix_finditer"], basestring_):
            self.infix_finditer = re.compile(data["infix_finditer"]).finditer
-        if data.get("token_match"):
+        if "token_match" in data and isinstance(data["token_match"], basestring_):
            self.token_match = re.compile(data["token_match"]).match
-        if data.get("rules"):
+        if "rules" in data and isinstance(data["rules"], dict):
            # make sure to hard reset the cache to remove data from the default exceptions
            self._rules = {}
            self._reset_cache([key for key in self._cache])
            self._reset_specials()
            self._cache = PreshMap()
            self._specials = PreshMap()
-            self._load_special_tokenization(data.get("rules", {}))
+            self._load_special_tokenization(data["rules"])

        return self

--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -213,6 +213,10 @@ def _merge(Doc doc, merges):
        new_orth = ''.join([t.text_with_ws for t in spans[token_index]])
        if spans[token_index][-1].whitespace_:
            new_orth = new_orth[:-len(spans[token_index][-1].whitespace_)]
+        # add the vector of the (merged) entity to the vocab
+        if not doc.vocab.get_vector(new_orth).any():
+            if doc.vocab.vectors_length > 0:
+                doc.vocab.set_vector(new_orth, span.vector)
        token = tokens[token_index]
        lex = doc.vocab.get(doc.mem, new_orth)
        token.lex = lex
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@ -785,10 +785,12 @@ cdef class Doc:
        # Allow strings, e.g. 'lemma' or 'LEMMA'
        attrs = [(IDS[id_.upper()] if hasattr(id_, "upper") else id_)
                 for id_ in attrs]
+        if array.dtype != numpy.uint64:
+            user_warning(Warnings.W028.format(type=array.dtype))

        if SENT_START in attrs and HEAD in attrs:
            raise ValueError(Errors.E032)
-        cdef int i, col
+        cdef int i, col, abs_head_index
        cdef attr_id_t attr_id
        cdef TokenC* tokens = self.c
        cdef int length = len(array)
@ -802,6 +804,14 @@ cdef class Doc:
            attr_ids[i] = attr_id
        if len(array.shape) == 1:
            array = array.reshape((array.size, 1))
+        # Check that all heads are within the document bounds
+        if HEAD in attrs:
+            col = attrs.index(HEAD)
+            for i in range(length):
+                # cast index to signed int
+                abs_head_index = numpy.int32(array[i, col]) + i
+                if abs_head_index < 0 or abs_head_index >= length:
+                    raise ValueError(Errors.E190.format(index=i, value=array[i, col], rel_head_index=numpy.int32(array[i, col])))
        # Do TAG first. This lets subsequent loop override stuff like POS, LEMMA
        if TAG in attrs:
            col = attrs.index(TAG)
@ -872,7 +882,7 @@ cdef class Doc:

        DOCS: https://spacy.io/api/doc#to_bytes
        """
-        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID]  # TODO: ENT_KB_ID ?
+        array_head = [LENGTH, SPACY, LEMMA, ENT_IOB, ENT_TYPE, ENT_ID, NORM]  # TODO: ENT_KB_ID ?
        if self.is_tagged:
            array_head.extend([TAG, POS])
        # If doc parsed add head and dep attribute
@ -1173,6 +1183,7 @@ cdef int set_children_from_heads(TokenC* tokens, int length) except -1:
        heads_within_sents = _set_lr_kids_and_edges(tokens, length, loop_count)
        if loop_count > 10:
            user_warning(Warnings.W026)
+            break
        loop_count += 1
    # Set sentence starts
    for i in range(length):
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -623,6 +623,9 @@ cdef class Token:
            # This function sets the head of self to new_head and updates the
            # counters for left/right dependents and left/right corner for the
            # new and the old head
+            # Check that token is from the same document
+            if self.doc != new_head.doc:
+                raise ValueError(Errors.E191)
            # Do nothing if old head is new head
            if self.i + self.c.head == new_head.i:
                return
--- a/spacy/tokens/underscore.py
+++ b/spacy/tokens/underscore.py
@ -79,6 +79,14 @@ class Underscore(object):
    def _get_key(self, name):
        return ("._.", name, self._start, self._end)

+    @classmethod
+    def get_state(cls):
+        return cls.token_extensions, cls.span_extensions, cls.doc_extensions
+
+    @classmethod
+    def load_state(cls, state):
+        cls.token_extensions, cls.span_extensions, cls.doc_extensions = state
+

 def get_ext_args(**kwargs):
    """Validate and convert arguments. Reused in Doc, Token and Span."""
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -184,16 +184,17 @@ low data labels and more.
 $ python -m spacy debug-data [lang] [train_path] [dev_path] [--base-model] [--pipeline] [--ignore-warnings] [--verbose] [--no-format]
 ```

-| Argument                   | Type       | Description                                                                                        |
-| -------------------------- | ---------- | -------------------------------------------------------------------------------------------------- |
-| `lang`                     | positional | Model language.                                                                                    |
-| `train_path`               | positional | Location of JSON-formatted training data. Can be a file or a directory of files.                   |
-| `dev_path`                 | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
-| `--base-model`, `-b`       | option     | Optional name of base model to update. Can be any loadable spaCy model.                            |
-| `--pipeline`, `-p`         | option     | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.          |
-| `--ignore-warnings`, `-IW` | flag       | Ignore warnings, only show stats and errors.                                                       |
-| `--verbose`, `-V`          | flag       | Print additional information and explanations.                                                     |
-| --no-format, `-NF`         | flag       | Don't pretty-print the results. Use this if you want to write to a file.                           |
+| Argument                                               | Type       | Description                                                                                        |
+| ------------------------------------------------------ | ---------- | -------------------------------------------------------------------------------------------------- |
+| `lang`                                                 | positional | Model language.                                                                                    |
+| `train_path`                                           | positional | Location of JSON-formatted training data. Can be a file or a directory of files.                   |
+| `dev_path`                                             | positional | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files. |
+| `--tag-map-path`, `-tm` <Tag variant="new">2.2.3</Tag> | option     | Location of JSON-formatted tag map.                                                                |
+| `--base-model`, `-b`                                   | option     | Optional name of base model to update. Can be any loadable spaCy model.                            |
+| `--pipeline`, `-p`                                     | option     | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.          |
+| `--ignore-warnings`, `-IW`                             | flag       | Ignore warnings, only show stats and errors.                                                       |
+| `--verbose`, `-V`                                      | flag       | Print additional information and explanations.                                                     |
+| --no-format, `-NF`                                     | flag       | Don't pretty-print the results. Use this if you want to write to a file.                           |

 <Accordion title="Example output">

@ -368,6 +369,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `dev_path`                                                      | positional    | Location of JSON-formatted development data for evaluation. Can be a file or a directory of files.                                                                |
 | `--base-model`, `-b` <Tag variant="new">2.1</Tag>               | option        | Optional name of base model to update. Can be any loadable spaCy model.                                                                                           |
 | `--pipeline`, `-p` <Tag variant="new">2.1</Tag>                 | option        | Comma-separated names of pipeline components to train. Defaults to `'tagger,parser,ner'`.                                                                         |
+| `--replace-components`, `-R`                                    | flag          | Replace components from the base model.                                                                                                                           |
 | `--vectors`, `-v`                                               | option        | Model to load vectors from.                                                                                                                                       |
 | `--n-iter`, `-n`                                                | option        | Number of iterations (default: `30`).                                                                                                                             |
 | `--n-early-stopping`, `-ne`                                     | option        | Maximum number of training epochs without dev accuracy improvement.                                                                                               |
@ -378,6 +380,13 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--init-tok2vec`, `-t2v` <Tag variant="new">2.1</Tag>           | option        | Path to pretrained weights for the token-to-vector parts of the models. See `spacy pretrain`. Experimental.                                                       |
 | `--parser-multitasks`, `-pt`                                    | option        | Side objectives for parser CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                       |
 | `--entity-multitasks`, `-et`                                    | option        | Side objectives for NER CNN, e.g. `'dep'` or `'dep,tag'`                                                                                                          |
+| `--width`, `-cw` <Tag variant="new">2.2.4</Tag>                 | option        | Width of CNN layers of `Tok2Vec` component.                                                                                                                       |
+| `--conv-depth`, `-cd` <Tag variant="new">2.2.4</Tag>            | option        | Depth of CNN layers of `Tok2Vec` component.                                                                                                                       |
+| `--cnn-window`, `-cW` <Tag variant="new">2.2.4</Tag>            | option        | Window size for CNN layers of `Tok2Vec` component.                                                                                                                |
+| `--cnn-pieces`, `-cP` <Tag variant="new">2.2.4</Tag>            | option        | Maxout size for CNN layers of `Tok2Vec` component.                                                                                                                |
+| `--use-chars`, `-chr` <Tag variant="new">2.2.4</Tag>            | flag          | Whether to use character-based embedding of `Tok2Vec` component.                                                                                                  |
+| `--bilstm-depth`, `-lstm` <Tag variant="new">2.2.4</Tag>        | option        | Depth of BiLSTM layers of `Tok2Vec` component (requires PyTorch).                                                                                                 |
+| `--embed-rows`, `-er` <Tag variant="new">2.2.4</Tag>            | option        | Number of embedding rows of `Tok2Vec` component.                                                                                                                  |
 | `--noise-level`, `-nl`                                          | option        | Float indicating the amount of corruption for data augmentation.                                                                                                  |
 | `--orth-variant-level`, `-ovl` <Tag variant="new">2.2</Tag>     | option        | Float indicating the orthography variation for data augmentation (e.g. `0.3` for making 30% of occurrences of some tokens subject to replacement).                |
 | `--gold-preproc`, `-G`                                          | flag          | Use gold preprocessing.                                                                                                                                           |
@ -385,6 +394,7 @@ $ python -m spacy train [lang] [output_path] [train_path] [dev_path]
 | `--textcat-multilabel`, `-TML` <Tag variant="new">2.2</Tag>     | flag          | Text classification classes aren't mutually exclusive (multilabel).                                                                                               |
 | `--textcat-arch`, `-ta` <Tag variant="new">2.2</Tag>            | option        | Text classification model architecture. Defaults to `"bow"`.                                                                                                      |
 | `--textcat-positive-label`, `-tpl` <Tag variant="new">2.2</Tag> | option        | Text classification positive label for binary classes with two labels.                                                                                            |
+| `--tag-map-path`, `-tm` <Tag variant="new">2.2.4</Tag>          | option        | Location of JSON-formatted tag map.                                                                                                                               |
 | `--verbose`, `-VV` <Tag variant="new">2.0.13</Tag>              | flag          | Show more detailed messages during training.                                                                                                                      |
 | `--help`, `-h`                                                  | flag          | Show help message and available arguments.                                                                                                                        |
 | **CREATES**                                                     | model, pickle | A spaCy model on each epoch.                                                                                                                                      |
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@ -7,9 +7,10 @@ source: spacy/tokens/doc.pyx

 A `Doc` is a sequence of [`Token`](/api/token) objects. Access sentences and
 named entities, export annotations to numpy arrays, losslessly serialize to
-compressed binary strings. The `Doc` object holds an array of [`TokenC`](/api/cython-structs#tokenc) structs.
-The Python-level `Token` and [`Span`](/api/span) objects are views of this
-array, i.e. they don't own the data themselves.
+compressed binary strings. The `Doc` object holds an array of
+[`TokenC`](/api/cython-structs#tokenc) structs. The Python-level `Token` and
+[`Span`](/api/span) objects are views of this array, i.e. they don't own the
+data themselves.

 ## Doc.\_\_init\_\_ {#init tag="method"}

@ -197,13 +198,14 @@ the character indices don't map to a valid span.
 > assert span.text == "New York"
 > ```

-| Name        | Type                                     | Description                                             |
-| ----------- | ---------------------------------------- | ------------------------------------------------------- |
-| `start`     | int                                      | The index of the first character of the span.           |
-| `end`       | int                                      | The index of the last character after the span.         |
-| `label`     | uint64 / unicode                         | A label to attach to the Span, e.g. for named entities. |
-| `vector`    | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                   |
-| **RETURNS** | `Span`                                   | The newly constructed object or `None`.                 |
+| Name                                 | Type                                     | Description                                                           |
+| ------------------------------------ | ---------------------------------------- | --------------------------------------------------------------------- |
+| `start`                              | int                                      | The index of the first character of the span.                         |
+| `end`                                | int                                      | The index of the last character after the span.                       |
+| `label`                              | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
+| `kb_id` <Tag variant="new">2.2</Tag> | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
+| `vector`                             | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
+| **RETURNS**                          | `Span`                                   | The newly constructed object or `None`.                               |

 ## Doc.similarity {#similarity tag="method" model="vectors"}

--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@ -172,6 +172,28 @@ Remove a previously registered extension.
 | `name`      | unicode | Name of the extension.                                                |
 | **RETURNS** | tuple   | A `(default, method, getter, setter)` tuple of the removed extension. |

+## Span.char_span {#char_span tag="method" new="2.2.4"}
+
+Create a `Span` object from the slice `span.text[start:end]`. Returns `None` if
+the character indices don't map to a valid span.
+
+> #### Example
+>
+> ```python
+> doc = nlp("I like New York")
+> span = doc[1:4].char_span(5, 13, label="GPE")
+> assert span.text == "New York"
+> ```
+
+| Name        | Type                                     | Description                                                           |
+| ----------- | ---------------------------------------- | --------------------------------------------------------------------- |
+| `start`     | int                                      | The index of the first character of the span.                         |
+| `end`       | int                                      | The index of the last character after the span.                       |
+| `label`     | uint64 / unicode                         | A label to attach to the span, e.g. for named entities.               |
+| `kb_id`     | uint64 / unicode                         | An ID from a knowledge base to capture the meaning of a named entity. |
+| `vector`    | `numpy.ndarray[ndim=1, dtype='float32']` | A meaning representation of the span.                                 |
+| **RETURNS** | `Span`                                   | The newly constructed object or `None`.                               |
+
 ## Span.similarity {#similarity tag="method" model="vectors"}

 Make a semantic similarity estimate. The default estimate is cosine similarity
@ -293,10 +315,10 @@ Create a new `Doc` object corresponding to the `Span`, with a copy of the data.
 > assert doc2.text == "New York"
 > ```

-| Name              | Type  | Description                                          |
-| ----------------- | ----- | ---------------------------------------------------- |
-| `copy_user_data`  | bool  | Whether or not to copy the original doc's user data. |
-| **RETURNS**       | `Doc` | A `Doc` object of the `Span`'s content.              |
+| Name             | Type  | Description                                          |
+| ---------------- | ----- | ---------------------------------------------------- |
+| `copy_user_data` | bool  | Whether or not to copy the original doc's user data. |
+| **RETURNS**      | `Doc` | A `Doc` object of the `Span`'s content.              |

 ## Span.root {#root tag="property" model="parser"}

--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@ -437,8 +437,8 @@ The L2 norm of the token's vector representation.
 | `norm_`                                      | unicode      | The token's norm, i.e. a normalized form of the token text. Usually set in the language's [tokenizer exceptions](/usage/adding-languages#tokenizer-exceptions) or [norm exceptions](/usage/adding-languages#norm-exceptions).                                 |
 | `lower`                                      | int          | Lowercase form of the token.                                                                                                                                                                                                                                  |
 | `lower_`                                     | unicode      | Lowercase form of the token text. Equivalent to `Token.text.lower()`.                                                                                                                                                                                         |
-| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
-| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape`                                      | int          | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
+| `shape_`                                     | unicode      | Transform of the tokens's string, to show orthographic features. Alphabetic characters are replaced by `x` or `X`, and numeric characters are replaced by `d`, and sequences of the same character are truncated after length 4. For example,`"Xxxx"`or`"dd"`. |
 | `prefix`                                     | int          | Hash value of a length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                            |
 | `prefix_`                                    | unicode      | A length-N substring from the start of the token. Defaults to `N=1`.                                                                                                                                                                                          |
 | `suffix`                                     | int          | Hash value of a length-N substring from the end of the token. Defaults to `N=3`.                                                                                                                                                                              |
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@ -236,22 +236,22 @@ If a setting is not present in the options, the default value will be used.
 > displacy.serve(doc, style="dep", options=options)
 > ```

-| Name               | Type    | Description                                                                                                     | Default                 |
-| ------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
-| `fine_grained`     | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
-| `add_lemma`        | bool    | Print the lemma's in a separate row below the token texts in the `dep` visualisation.                           | `False`                 |
-| `collapse_punct`   | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
-| `collapse_phrases` | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
-| `compact`          | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
-| `color`            | unicode | Text color (HEX, RGB or color names).                                                                           | `'#000000'`             |
-| `bg`               | unicode | Background color (HEX, RGB or color names).                                                                     | `'#ffffff'`             |
-| `font`             | unicode | Font name or font family for all text.                                                                          | `'Arial'`               |
-| `offset_x`         | int     | Spacing on left side of the SVG in px.                                                                          | `50`                    |
-| `arrow_stroke`     | int     | Width of arrow path in px.                                                                                      | `2`                     |
-| `arrow_width`      | int     | Width of arrow head in px.                                                                                      | `10` / `8` (compact)    |
-| `arrow_spacing`    | int     | Spacing between arrows in px to avoid overlaps.                                                                 | `20` / `12` (compact)   |
-| `word_spacing`     | int     | Vertical spacing between words and arcs in px.                                                                  | `45`                    |
-| `distance`         | int     | Distance between words in px.                                                                                   | `175` / `150` (compact) |
+| Name                                       | Type    | Description                                                                                                     | Default                 |
+| ------------------------------------------ | ------- | --------------------------------------------------------------------------------------------------------------- | ----------------------- |
+| `fine_grained`                             | bool    | Use fine-grained part-of-speech tags (`Token.tag_`) instead of coarse-grained tags (`Token.pos_`).              | `False`                 |
+| `add_lemma` <Tag variant="new">2.2.4</Tag> | bool    | Print the lemma's in a separate row below the token texts.                                                      | `False`                 |
+| `collapse_punct`                           | bool    | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. | `True`                  |
+| `collapse_phrases`                         | bool    | Merge noun phrases into one token.                                                                              | `False`                 |
+| `compact`                                  | bool    | "Compact mode" with square arrows that takes up less space.                                                     | `False`                 |
+| `color`                                    | unicode | Text color (HEX, RGB or color names).                                                                           | `'#000000'`             |
+| `bg`                                       | unicode | Background color (HEX, RGB or color names).                                                                     | `'#ffffff'`             |
+| `font`                                     | unicode | Font name or font family for all text.                                                                          | `'Arial'`               |
+| `offset_x`                                 | int     | Spacing on left side of the SVG in px.                                                                          | `50`                    |
+| `arrow_stroke`                             | int     | Width of arrow path in px.                                                                                      | `2`                     |
+| `arrow_width`                              | int     | Width of arrow head in px.                                                                                      | `10` / `8` (compact)    |
+| `arrow_spacing`                            | int     | Spacing between arrows in px to avoid overlaps.                                                                 | `20` / `12` (compact)   |
+| `word_spacing`                             | int     | Vertical spacing between words and arcs in px.                                                                  | `45`                    |
+| `distance`                                 | int     | Distance between words in px.                                                                                   | `175` / `150` (compact) |

 #### Named Entity Visualizer options {#displacy_options-ent}

--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@ -95,6 +95,8 @@
            "has_examples": true
        },
        { "code": "hr", "name": "Croatian", "has_examples": true },
+        { "code": "eu", "name": "Basque", "has_examples": true },
+        { "code": "yo", "name": "Yoruba", "has_examples": true },
        { "code": "tr", "name": "Turkish", "example": "Bu bir cümledir.", "has_examples": true },
        { "code": "ca", "name": "Catalan", "example": "Això és una frase.", "has_examples": true },
        { "code": "he", "name": "Hebrew", "example": "זהו משפט.", "has_examples": true },
--- a/website/src/images/logos/allenai.svg
+++ b/website/src/images/logos/allenai.svg
@ -1,6 +1,6 @@
 <svg xmlns="http://www.w3.org/2000/svg" width="220" height="37" viewBox="0 0 610 103">
 <defs>
-    <radialGradient id="gradient_allenai1 "cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
+    <radialGradient id="gradient_allenai1" cx="75.721" cy="20.894" r="11.05" gradientUnits="userSpaceOnUse">
        <stop offset=".3" stop-color="#FDEA65" />
        <stop offset="1" stop-color="#FCB431" />
    </radialGradient>