Merge branch 'master' into spacy.io

2025-11-22 02:36:03 +03:00 · 2019-11-21 19:23:19 +01:00 · 2019-11-21 19:23:19 +01:00 · 02de21d8b4
commit 02de21d8b4
parent 534c4aa55b a0fb1acb10
31 changed files with 893 additions and 126 deletions
--- a/.github/contributors/GuiGel.md
+++ b/.github/contributors/GuiGel.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Guillaume Gelabert   |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2019-11-15           |
+| GitHub username                | GuiGel               |
+| Website (optional)             |                      |
--- a/.github/contributors/erip.md
+++ b/.github/contributors/erip.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |      Elijah Rippeth         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |      2019-11-16    |
+| GitHub username                |           erip      |
+| Website (optional)             |                      |
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -50,15 +50,16 @@ jobs:
      Python36Mac:
        imageName: 'macos-10.13'
        python.version: '3.6'
-      Python37Linux:
-        imageName: 'ubuntu-16.04'
-        python.version: '3.7'
-      Python37Windows:
-        imageName: 'vs2017-win2016'
-        python.version: '3.7'
-      Python37Mac:
-        imageName: 'macos-10.13'
-        python.version: '3.7'
+      # Don't test on 3.7 for now to speed up builds
+      # Python37Linux:
+      #   imageName: 'ubuntu-16.04'
+      #   python.version: '3.7'
+      # Python37Windows:
+      #   imageName: 'vs2017-win2016'
+      #   python.version: '3.7'
+      # Python37Mac:
+      #   imageName: 'macos-10.13'
+      #   python.version: '3.7'
      Python38Linux:
        imageName: 'ubuntu-16.04'
        python.version: '3.8'
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "2.2.2"
+__version__ = "2.2.3"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -529,6 +529,7 @@ class Errors(object):
    E185 = ("Received invalid attribute in component attribute declaration: "
            "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
    E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
+    E187 = ("Only unicode strings are supported as labels.")


@add_codes
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -31,6 +31,10 @@ _latin_u_supplement = r"\u00C0-\u00D6\u00D8-\u00DE"
 _latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF"
 _latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF"

+_hangul_syllables = r"\uAC00-\uD7AF"
+_hangul_jamo = r"\u1100-\u11FF"
+_hangul = _hangul_syllables + _hangul_jamo
+
 # letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
 _latin_u_extendedA = (
    r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
@ -202,7 +206,15 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
 _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower

 _uncased = (
-    _bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu
+    _bengali
+    + _hebrew
+    + _persian
+    + _sinhala
+    + _hindi
+    + _kannada
+    + _tamil
+    + _telugu
+    + _hangul
 )

 ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
--- a/spacy/lang/ko/lex_attrs.py
+++ b/spacy/lang/ko/lex_attrs.py
@ -0,0 +1,67 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ...attrs import LIKE_NUM
+
+
+_num_words = [
+    "영",
+    "공",
+    # Native Korean number system
+    "하나",
+    "둘",
+    "셋",
+    "넷",
+    "다섯",
+    "여섯",
+    "일곱",
+    "여덟",
+    "아홉",
+    "열",
+    "스물",
+    "서른",
+    "마흔",
+    "쉰",
+    "예순",
+    "일흔",
+    "여든",
+    "아흔",
+    # Sino-Korean number system
+    "일",
+    "이",
+    "삼",
+    "사",
+    "오",
+    "육",
+    "칠",
+    "팔",
+    "구",
+    "십",
+    "백",
+    "천",
+    "만",
+    "십만",
+    "백만",
+    "천만",
+    "일억",
+    "십억",
+    "백억",
+]
+
+
+def like_num(text):
+    if text.startswith(("+", "-", "±", "~")):
+        text = text[1:]
+    text = text.replace(",", "").replace(".", "")
+    if text.isdigit():
+        return True
+    if text.count("/") == 1:
+        num, denom = text.split("/")
+        if num.isdigit() and denom.isdigit():
+            return True
+    if any(char.lower() in _num_words for char in text):
+        return True
+    return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -6,9 +6,7 @@ from ...symbols import ORTH, LEMMA, NORM
 # TODO
 # treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)

-_exc = {
-    
-}
+_exc = {}

 # translate / delete what is not necessary
 for exc_data in [
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -14,6 +14,7 @@ from .tag_map import TAG_MAP
 def try_jieba_import(use_jieba):
    try:
        import jieba
+
        return jieba
    except ImportError:
        if use_jieba:
@ -34,7 +35,9 @@ class ChineseTokenizer(DummyTokenizer):
    def __call__(self, text):
        # use jieba
        if self.use_jieba:
-            jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
+            jieba_words = list(
+                [x for x in self.jieba_seg.cut(text, cut_all=False) if x]
+            )
            words = [jieba_words[0]]
            spaces = [False]
            for i in range(1, len(jieba_words)):
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -292,13 +292,14 @@ class EntityRuler(object):
            self.add_patterns(patterns)
        else:
            cfg = {}
-            deserializers = {
+            deserializers_patterns = {
                "patterns": lambda p: self.add_patterns(
                    srsly.read_jsonl(p.with_suffix(".jsonl"))
-                ),
-                "cfg": lambda p: cfg.update(srsly.read_json(p)),
+                )}
+            deserializers_cfg = {
+                "cfg": lambda p: cfg.update(srsly.read_json(p))
            }
-            from_disk(path, deserializers, {})
+            from_disk(path, deserializers_cfg, {})
            self.overwrite = cfg.get("overwrite", False)
            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
@ -307,6 +308,7 @@ class EntityRuler(object):
                self.phrase_matcher = PhraseMatcher(
                    self.nlp.vocab, attr=self.phrase_matcher_attr
                )
+            from_disk(path, deserializers_patterns, {})
        return self

    def to_disk(self, path, **kwargs):
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -13,6 +13,7 @@ from thinc.misc import LayerNorm
 from thinc.neural.util import to_categorical
 from thinc.neural.util import get_array_module

+from ..compat import basestring_
 from ..tokens.doc cimport Doc
 from ..syntax.nn_parser cimport Parser
 from ..syntax.ner cimport BiluoPushDown
@ -547,6 +548,8 @@ class Tagger(Pipe):
        return build_tagger_model(n_tags, **cfg)

    def add_label(self, label, values=None):
+        if not isinstance(label, basestring_):
+            raise ValueError(Errors.E187)
        if label in self.labels:
            return 0
        if self.model not in (True, False, None):
@ -1016,6 +1019,8 @@ class TextCategorizer(Pipe):
        return float(mean_square_error), d_scores

    def add_label(self, label):
+        if not isinstance(label, basestring_):
+            raise ValueError(Errors.E187)
        if label in self.labels:
            return 0
        if self.model not in (None, True, False):
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -271,7 +271,9 @@ class Scorer(object):
                        self.labelled_per_dep[token.dep_.lower()] = PRFScore()
                    if token.dep_.lower() not in cand_deps_per_dep:
                        cand_deps_per_dep[token.dep_.lower()] = set()
-                    cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
+                    cand_deps_per_dep[token.dep_.lower()].add(
+                        (gold_i, gold_head, token.dep_.lower())
+                    )
        if "-" not in [token[-1] for token in gold.orig_annot]:
            # Find all NER labels in gold and doc
            ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
@ -304,7 +306,9 @@ class Scorer(object):
        self.tags.score_set(cand_tags, gold_tags)
        self.labelled.score_set(cand_deps, gold_deps)
        for dep in self.labelled_per_dep:
-            self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
+            self.labelled_per_dep[dep].score_set(
+                cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set())
+            )
        self.unlabelled.score_set(
            set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
        )
--- a/spacy/syntax/_parser_model.pyx
+++ b/spacy/syntax/_parser_model.pyx
@ -42,11 +42,17 @@ cdef WeightsC get_c_weights(model) except *:
    cdef precompute_hiddens state2vec = model.state2vec
    output.feat_weights = state2vec.get_feat_weights()
    output.feat_bias = <const float*>state2vec.bias.data
-    cdef np.ndarray vec2scores_W = model.vec2scores.W
-    cdef np.ndarray vec2scores_b = model.vec2scores.b
+    cdef np.ndarray vec2scores_W
+    cdef np.ndarray vec2scores_b
+    if model.vec2scores is None:
+        output.hidden_weights = NULL
+        output.hidden_bias = NULL
+    else:
+        vec2scores_W = model.vec2scores.W
+        vec2scores_b = model.vec2scores.b
+        output.hidden_weights = <const float*>vec2scores_W.data
+        output.hidden_bias = <const float*>vec2scores_b.data
    cdef np.ndarray class_mask = model._class_mask
-    output.hidden_weights = <const float*>vec2scores_W.data
-    output.hidden_bias = <const float*>vec2scores_b.data
    output.seen_classes = <const float*>class_mask.data
    return output

@ -54,7 +60,10 @@ cdef WeightsC get_c_weights(model) except *:
 cdef SizesC get_c_sizes(model, int batch_size) except *:
    cdef SizesC output
    output.states = batch_size
-    output.classes = model.vec2scores.nO
+    if model.vec2scores is None:
+        output.classes = model.state2vec.nO
+    else:
+        output.classes = model.vec2scores.nO
    output.hiddens = model.state2vec.nO
    output.pieces = model.state2vec.nP
    output.feats = model.state2vec.nF
@ -105,11 +114,12 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:

 cdef void predict_states(ActivationsC* A, StateC** states,
        const WeightsC* W, SizesC n) nogil:
+    cdef double one = 1.0
    resize_activations(A, n)
-    memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
-    memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
    for i in range(n.states):
        states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
+    memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
+    memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
    sum_state_features(A.unmaxed,
        W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
    for i in range(n.states):
@ -120,18 +130,20 @@ cdef void predict_states(ActivationsC* A, StateC** states,
            which = Vec.arg_max(&A.unmaxed[index], n.pieces)
            A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
    memset(A.scores, 0, n.states * n.classes * sizeof(float))
-    cdef double one = 1.0
-    # Compute hidden-to-output
-    blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
-        n.states, n.classes, n.hiddens, one,
-        <float*>A.hiddens, n.hiddens, 1,
-        <float*>W.hidden_weights, n.hiddens, 1,
-        one,
-        <float*>A.scores, n.classes, 1)
-    # Add bias
-    for i in range(n.states):
-        VecVec.add_i(&A.scores[i*n.classes],
-            W.hidden_bias, 1., n.classes)
+    if W.hidden_weights == NULL:
+        memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
+    else:
+        # Compute hidden-to-output
+        blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
+            n.states, n.classes, n.hiddens, one,
+            <float*>A.hiddens, n.hiddens, 1,
+            <float*>W.hidden_weights, n.hiddens, 1,
+            one,
+            <float*>A.scores, n.classes, 1)
+        # Add bias
+        for i in range(n.states):
+            VecVec.add_i(&A.scores[i*n.classes],
+                W.hidden_bias, 1., n.classes)
    # Set unseen classes to minimum value
    i = 0
    min_ = A.scores[0]
@ -219,7 +231,9 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no
 class ParserModel(Model):
    def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None):
        Model.__init__(self)
-        self._layers = [tok2vec, lower_model, upper_model]
+        self._layers = [tok2vec, lower_model]
+        if upper_model is not None:
+            self._layers.append(upper_model)
        self.unseen_classes = set()
        if unseen_classes:
            for class_ in unseen_classes:
@ -234,6 +248,8 @@ class ParserModel(Model):
        return step_model, finish_parser_update

    def resize_output(self, new_output):
+        if len(self._layers) == 2:
+            return 
        if new_output == self.upper.nO:
            return
        smaller = self.upper
@ -275,12 +291,24 @@ class ParserModel(Model):
 class ParserStepModel(Model):
    def __init__(self, docs, layers, unseen_classes=None, drop=0.):
        self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop)
+        if layers[1].nP >= 2:
+            activation = "maxout"
+        elif len(layers) == 2:
+            activation = None
+        else:
+            activation = "relu"
        self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
-                                            drop=drop)
-        self.vec2scores = layers[-1]
-        self.cuda_stream = util.get_cuda_stream()
+                                            activation=activation, drop=drop)
+        if len(layers) == 3:
+            self.vec2scores = layers[-1]
+        else:
+            self.vec2scores = None
+        self.cuda_stream = util.get_cuda_stream(non_blocking=True)
        self.backprops = []
-        self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
+        if self.vec2scores is None:
+            self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f')
+        else:
+            self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
        self._class_mask.fill(1)
        if unseen_classes is not None:
            for class_ in unseen_classes:
@ -302,10 +330,15 @@ class ParserStepModel(Model):
    def begin_update(self, states, drop=0.):
        token_ids = self.get_token_ids(states)
        vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0)
-        mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
-        if mask is not None:
-            vector *= mask
-        scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
+        if self.vec2scores is not None:
+            mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
+            if mask is not None:
+                vector *= mask
+            scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
+        else:
+            scores = NumpyOps().asarray(vector)
+            get_d_vector = lambda d_scores, sgd=None: d_scores
+            mask = None
        # If the class is unseen, make sure its score is minimum
        scores[:, self._class_mask == 0] = numpy.nanmin(scores)

@ -342,12 +375,12 @@ class ParserStepModel(Model):
        return ids

    def make_updates(self, sgd):
-        # Tells CUDA to block, so our async copies complete.
-        if self.cuda_stream is not None:
-            self.cuda_stream.synchronize()
        # Add a padding vector to the d_tokvecs gradient, so that missing
        # values don't affect the real gradient.
        d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
+        # Tells CUDA to block, so our async copies complete.
+        if self.cuda_stream is not None:
+            self.cuda_stream.synchronize()
        for ids, d_vector, bp_vector in self.backprops:
            d_state_features = bp_vector((d_vector, ids), sgd=sgd)
            ids = ids.flatten()
@ -385,9 +418,10 @@ cdef class precompute_hiddens:
    cdef np.ndarray bias
    cdef object _cuda_stream
    cdef object _bp_hiddens
+    cdef object activation

    def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
-                 drop=0.):
+                 activation="maxout", drop=0.):
        gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop)
        cdef np.ndarray cached
        if not isinstance(gpu_cached, numpy.ndarray):
@ -405,6 +439,8 @@ cdef class precompute_hiddens:
        self.nP = getattr(lower_model, 'nP', 1)
        self.nO = cached.shape[2]
        self.ops = lower_model.ops
+        assert activation in (None, "relu", "maxout")
+        self.activation = activation
        self._is_synchronized = False
        self._cuda_stream = cuda_stream
        self._cached = cached
@ -417,7 +453,7 @@ cdef class precompute_hiddens:
        return <float*>self._cached.data

    def __call__(self, X):
-        return self.begin_update(X)[0]
+        return self.begin_update(X, drop=None)[0]

    def begin_update(self, token_ids, drop=0.):
        cdef np.ndarray state_vector = numpy.zeros(
@ -450,28 +486,35 @@ cdef class precompute_hiddens:
        else:
            ops = CupyOps()
 
-        if self.nP == 1:
-            state_vector = state_vector.reshape(state_vector.shape[:-1])
-            mask = state_vector >= 0.
-            state_vector *= mask
-        else:
+        if self.activation == "maxout":
            state_vector, mask = ops.maxout(state_vector)
+        else:
+            state_vector = state_vector.reshape(state_vector.shape[:-1])
+            if self.activation == "relu":
+                mask = state_vector >= 0.
+                state_vector *= mask
+            else:
+                mask = None

        def backprop_nonlinearity(d_best, sgd=None):
            if isinstance(d_best, numpy.ndarray):
                ops = NumpyOps()
            else:
                ops = CupyOps()
-            mask_ = ops.asarray(mask)
-
+            if mask is not None:
+                mask_ = ops.asarray(mask)
            # This will usually be on GPU
            d_best = ops.asarray(d_best)
            # Fix nans (which can occur from unseen classes.)
            d_best[ops.xp.isnan(d_best)] = 0.
-            if self.nP == 1:
+            if self.activation == "maxout":
+                mask_ = ops.asarray(mask)
+                return ops.backprop_maxout(d_best, mask_, self.nP)
+            elif self.activation == "relu":
+                mask_ = ops.asarray(mask)
                d_best *= mask_
                d_best = d_best.reshape((d_best.shape + (1,)))
                return d_best
            else:
-                return ops.backprop_maxout(d_best, mask_, self.nP)
+                return d_best.reshape((d_best.shape + (1,)))
        return state_vector, backprop_nonlinearity
--- a/spacy/syntax/_state.pxd
+++ b/spacy/syntax/_state.pxd
@ -100,10 +100,30 @@ cdef cppclass StateC:
        free(this.shifted - PADDING)

    void set_context_tokens(int* ids, int n) nogil:
-        if n == 2:
+        if n == 1:
+            if this.B(0) >= 0:
+                ids[0] = this.B(0)
+            else:
+                ids[0] = -1
+        elif n == 2:
            ids[0] = this.B(0)
            ids[1] = this.S(0)
-        if n == 8:
+        elif n == 3:
+            if this.B(0) >= 0:
+                ids[0] = this.B(0)
+            else:
+                ids[0] = -1
+            # First word of entity, if any
+            if this.entity_is_open():
+                ids[1] = this.E(0)
+            else:
+                ids[1] = -1
+            # Last word of entity, if within entity
+            if ids[0] == -1 or ids[1] == -1:
+                ids[2] = -1
+            else:
+                ids[2] = ids[0] - 1
+        elif n == 8:
            ids[0] = this.B(0)
            ids[1] = this.B(1)
            ids[2] = this.S(0)
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -22,7 +22,7 @@ from thinc.extra.search cimport Beam
 from thinc.api import chain, clone
 from thinc.v2v import Model, Maxout, Affine
 from thinc.misc import LayerNorm
-from thinc.neural.ops import CupyOps
+from thinc.neural.ops import NumpyOps, CupyOps
 from thinc.neural.util import get_array_module
 from thinc.linalg cimport Vec, VecVec
 import srsly
@ -61,13 +61,17 @@ cdef class Parser:
        t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
        bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
        self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
-        if depth != 1:
+        nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature)
+        if depth not in (0, 1):
            raise ValueError(TempErrors.T004.format(value=depth))
        parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
                                            cfg.get('maxout_pieces', 2))
        token_vector_width = util.env_opt('token_vector_width',
                                           cfg.get('token_vector_width', 96))
        hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64))
+        if depth == 0:
+            hidden_width = nr_class
+            parser_maxout_pieces = 1
        embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000))
        pretrained_vectors = cfg.get('pretrained_vectors', None)
        tok2vec = Tok2Vec(token_vector_width, embed_size,
@ -80,16 +84,19 @@ cdef class Parser:
        tok2vec = chain(tok2vec, flatten)
        tok2vec.nO = token_vector_width
        lower = PrecomputableAffine(hidden_width,
-                    nF=cls.nr_feature, nI=token_vector_width,
+                    nF=nr_feature_tokens, nI=token_vector_width,
                    nP=parser_maxout_pieces)
        lower.nP = parser_maxout_pieces
-
-        with Model.use_device('cpu'):
-            upper = Affine(nr_class, hidden_width, drop_factor=0.0)
-        upper.W *= 0
+        if depth == 1:
+            with Model.use_device('cpu'):
+                upper = Affine(nr_class, hidden_width, drop_factor=0.0)
+            upper.W *= 0
+        else:
+            upper = None

        cfg = {
            'nr_class': nr_class,
+            'nr_feature_tokens': nr_feature_tokens,
            'hidden_depth': depth,
            'token_vector_width': token_vector_width,
            'hidden_width': hidden_width,
@ -133,6 +140,7 @@ cdef class Parser:
        if 'beam_update_prob' not in cfg:
            cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0)
        cfg.setdefault('cnn_maxout_pieces', 3)
+        cfg.setdefault("nr_feature_tokens", self.nr_feature)
        self.cfg = cfg
        self.model = model
        self._multitasks = []
@ -299,7 +307,7 @@ cdef class Parser:
        token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
                                 dtype='i', order='C')
        cdef int* c_ids
-        cdef int nr_feature = self.nr_feature
+        cdef int nr_feature = self.cfg["nr_feature_tokens"]
        cdef int n_states
        model = self.model(docs)
        todo = [beam for beam in beams if not beam.is_done]
@ -502,7 +510,7 @@ cdef class Parser:
            self.moves.preprocess_gold(gold)
        model, finish_update = self.model.begin_update(docs, drop=drop)
        states_d_scores, backprops, beams = _beam_utils.update_beam(
-            self.moves, self.nr_feature, 10000, states, golds, model.state2vec,
+            self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec,
            model.vec2scores, width, drop=drop, losses=losses,
            beam_density=beam_density)
        for i, d_scores in enumerate(states_d_scores):
--- a/spacy/tests/lang/en/test_customized_tokenizer.py
+++ b/spacy/tests/lang/en/test_customized_tokenizer.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals

 import pytest
+import re
 from spacy.lang.en import English
 from spacy.tokenizer import Tokenizer
 from spacy.util import compile_prefix_regex, compile_suffix_regex
@ -19,13 +20,14 @@ def custom_en_tokenizer(en_vocab):
        r"[\[\]!&:,()\*—–\/-]",
    ]
    infix_re = compile_infix_regex(custom_infixes)
+    token_match_re = re.compile("a-b")
    return Tokenizer(
        en_vocab,
        English.Defaults.tokenizer_exceptions,
        prefix_re.search,
        suffix_re.search,
        infix_re.finditer,
-        token_match=None,
+        token_match=token_match_re.match,
    )


@ -74,3 +76,81 @@ def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
        "Megaregion",
        ".",
    ]
+
+
+def test_en_customized_tokenizer_handles_token_match(custom_en_tokenizer):
+    sentence = "The 8 and 10-county definitions a-b not used for the greater Southern California Megaregion."
+    context = [word.text for word in custom_en_tokenizer(sentence)]
+    assert context == [
+        "The",
+        "8",
+        "and",
+        "10",
+        "-",
+        "county",
+        "definitions",
+        "a-b",
+        "not",
+        "used",
+        "for",
+        "the",
+        "greater",
+        "Southern",
+        "California",
+        "Megaregion",
+        ".",
+    ]
+
+
+def test_en_customized_tokenizer_handles_rules(custom_en_tokenizer):
+    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
+    context = [word.text for word in custom_en_tokenizer(sentence)]
+    assert context == [
+        "The",
+        "8",
+        "and",
+        "10",
+        "-",
+        "county",
+        "definitions",
+        "are",
+        "not",
+        "used",
+        "for",
+        "the",
+        "greater",
+        "Southern",
+        "California",
+        "Megaregion",
+        ".",
+        ":)",
+    ]
+
+
+def test_en_customized_tokenizer_handles_rules_property(custom_en_tokenizer):
+    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
+    rules = custom_en_tokenizer.rules
+    del rules[":)"]
+    custom_en_tokenizer.rules = rules
+    context = [word.text for word in custom_en_tokenizer(sentence)]
+    assert context == [
+        "The",
+        "8",
+        "and",
+        "10",
+        "-",
+        "county",
+        "definitions",
+        "are",
+        "not",
+        "used",
+        "for",
+        "the",
+        "greater",
+        "Southern",
+        "California",
+        "Megaregion",
+        ".",
+        ":",
+        ")",
+    ]
--- a/spacy/tests/parser/test_ner.py
+++ b/spacy/tests/parser/test_ner.py
@ -259,6 +259,27 @@ def test_block_ner():
    assert [token.ent_type_ for token in doc] == expected_types


+def test_change_number_features():
+    # Test the default number features
+    nlp = English()
+    ner = nlp.create_pipe("ner")
+    nlp.add_pipe(ner)
+    ner.add_label("PERSON")
+    nlp.begin_training()
+    assert ner.model.lower.nF == ner.nr_feature
+    # Test we can change it
+    nlp = English()
+    ner = nlp.create_pipe("ner")
+    nlp.add_pipe(ner)
+    ner.add_label("PERSON")
+    nlp.begin_training(
+        component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}}
+    )
+    assert ner.model.lower.nF == 3
+    # Test the model runs
+    nlp("hello world")
+
+
 class BlockerComponent1(object):
    name = "my_blocker"

--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -0,0 +1,14 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.language import Language
+from spacy.pipeline import Tagger
+
+
+def test_label_types():
+    nlp = Language()
+    nlp.add_pipe(nlp.create_pipe("tagger"))
+    nlp.get_pipe("tagger").add_label("A")
+    with pytest.raises(ValueError):
+        nlp.get_pipe("tagger").add_label(9)
--- a/spacy/tests/pipeline/test_textcat.py
+++ b/spacy/tests/pipeline/test_textcat.py
@ -62,3 +62,11 @@ def test_textcat_learns_multilabel():
                    assert score < 0.5
                else:
                    assert score > 0.5
+
+
+def test_label_types():
+    nlp = Language()
+    nlp.add_pipe(nlp.create_pipe("textcat"))
+    nlp.get_pipe("textcat").add_label("answer")
+    with pytest.raises(ValueError):
+        nlp.get_pipe("textcat").add_label(9)
--- a/spacy/tests/regression/test_issue4402.py
+++ b/spacy/tests/regression/test_issue4402.py
@ -3,9 +3,9 @@ from __future__ import unicode_literals

 import srsly
 from spacy.gold import GoldCorpus
-
 from spacy.lang.en import English
-from spacy.tests.util import make_tempdir
+
+from ..util import make_tempdir


 def test_issue4402():
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -1,7 +1,6 @@
 # coding: utf-8
 from __future__ import unicode_literals

-import pytest
 from mock import Mock
 from spacy.matcher import DependencyMatcher
 from ..util import get_doc
@ -11,8 +10,14 @@ def test_issue4590(en_vocab):
    """Test that matches param in on_match method are the same as matches run with no on_match method"""
    pattern = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
-        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
-        {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+        {
+            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
+            "PATTERN": {"ORTH": "fox"},
+        },
+        {
+            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
+            "PATTERN": {"ORTH": "fox"},
+        },
    ]

    on_match = Mock()
@ -23,12 +28,11 @@ def test_issue4590(en_vocab):
    text = "The quick brown fox jumped over the lazy fox"
    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
    deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
-    
+
    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
-    
+
    matches = matcher(doc)
-    
+
    on_match_args = on_match.call_args

    assert on_match_args[0][3] == matches
-
--- a/spacy/tests/regression/test_issue4651.py
+++ b/spacy/tests/regression/test_issue4651.py
@ -0,0 +1,65 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+from spacy.lang.en import English
+from spacy.pipeline import EntityRuler
+
+from ..util import make_tempdir
+
+
+def test_issue4651_with_phrase_matcher_attr():
+    """Test that the EntityRuler PhraseMatcher is deserialize correctly using
+    the method from_disk when the EntityRuler argument phrase_matcher_attr is
+    specified.
+    """
+    text = "Spacy is a python library for nlp"
+
+    nlp = English()
+    ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER")
+    patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
+    ruler.add_patterns(patterns)
+    nlp.add_pipe(ruler)
+
+    doc = nlp(text)
+    res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
+
+    nlp_reloaded = English()
+    with make_tempdir() as d:
+        file_path = d / "entityruler"
+        ruler.to_disk(file_path)
+        ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
+
+    nlp_reloaded.add_pipe(ruler_reloaded)
+    doc_reloaded = nlp_reloaded(text)
+    res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
+
+    assert res == res_reloaded
+
+
+def test_issue4651_without_phrase_matcher_attr():
+    """Test that the EntityRuler PhraseMatcher is deserialize correctly using
+    the method from_disk when the EntityRuler argument phrase_matcher_attr is
+    not specified.
+    """
+    text = "Spacy is a python library for nlp"
+
+    nlp = English()
+    ruler = EntityRuler(nlp)
+    patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
+    ruler.add_patterns(patterns)
+    nlp.add_pipe(ruler)
+
+    doc = nlp(text)
+    res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
+
+    nlp_reloaded = English()
+    with make_tempdir() as d:
+        file_path = d / "entityruler"
+        ruler.to_disk(file_path)
+        ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
+
+    nlp_reloaded.add_pipe(ruler_reloaded)
+    doc_reloaded = nlp_reloaded(text)
+    res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
+
+    assert res == res_reloaded
--- a/spacy/tests/test_scorer.py
+++ b/spacy/tests/test_scorer.py
@ -12,8 +12,22 @@ from .util import get_doc
 test_las_apple = [
    [
        "Apple is looking at buying U.K. startup for $ 1 billion",
-        {"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
-         "deps": ['nsubj', 'aux', 'ROOT', 'prep', 'pcomp', 'compound', 'dobj', 'prep', 'quantmod', 'compound', 'pobj']},
+        {
+            "heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
+            "deps": [
+                "nsubj",
+                "aux",
+                "ROOT",
+                "prep",
+                "pcomp",
+                "compound",
+                "dobj",
+                "prep",
+                "quantmod",
+                "compound",
+                "pobj",
+            ],
+        },
    ]
 ]

@ -59,7 +73,7 @@ def test_las_per_type(en_vocab):
            en_vocab,
            words=input_.split(" "),
            heads=([h - i for i, h in enumerate(annot["heads"])]),
-            deps=annot["deps"]
+            deps=annot["deps"],
        )
        gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"])
        doc[0].dep_ = "compound"
--- a/spacy/tests/tokenizer/test_explain.py
+++ b/spacy/tests/tokenizer/test_explain.py
@ -0,0 +1,65 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+from spacy.util import get_lang_class
+
+# Only include languages with no external dependencies
+# "is" seems to confuse importlib, so we're also excluding it for now
+# excluded: ja, ru, th, uk, vi, zh, is
+LANGUAGES = [
+    pytest.param("fr", marks=pytest.mark.slow()),
+    pytest.param("af", marks=pytest.mark.slow()),
+    pytest.param("ar", marks=pytest.mark.slow()),
+    pytest.param("bg", marks=pytest.mark.slow()),
+    "bn",
+    pytest.param("ca", marks=pytest.mark.slow()),
+    pytest.param("cs", marks=pytest.mark.slow()),
+    pytest.param("da", marks=pytest.mark.slow()),
+    pytest.param("de", marks=pytest.mark.slow()),
+    "el",
+    "en",
+    pytest.param("es", marks=pytest.mark.slow()),
+    pytest.param("et", marks=pytest.mark.slow()),
+    pytest.param("fa", marks=pytest.mark.slow()),
+    pytest.param("fi", marks=pytest.mark.slow()),
+    "fr",
+    pytest.param("ga", marks=pytest.mark.slow()),
+    pytest.param("he", marks=pytest.mark.slow()),
+    pytest.param("hi", marks=pytest.mark.slow()),
+    pytest.param("hr", marks=pytest.mark.slow()),
+    "hu",
+    pytest.param("id", marks=pytest.mark.slow()),
+    pytest.param("it", marks=pytest.mark.slow()),
+    pytest.param("kn", marks=pytest.mark.slow()),
+    pytest.param("lb", marks=pytest.mark.slow()),
+    pytest.param("lt", marks=pytest.mark.slow()),
+    pytest.param("lv", marks=pytest.mark.slow()),
+    pytest.param("nb", marks=pytest.mark.slow()),
+    pytest.param("nl", marks=pytest.mark.slow()),
+    "pl",
+    pytest.param("pt", marks=pytest.mark.slow()),
+    pytest.param("ro", marks=pytest.mark.slow()),
+    pytest.param("si", marks=pytest.mark.slow()),
+    pytest.param("sk", marks=pytest.mark.slow()),
+    pytest.param("sl", marks=pytest.mark.slow()),
+    pytest.param("sq", marks=pytest.mark.slow()),
+    pytest.param("sr", marks=pytest.mark.slow()),
+    pytest.param("sv", marks=pytest.mark.slow()),
+    pytest.param("ta", marks=pytest.mark.slow()),
+    pytest.param("te", marks=pytest.mark.slow()),
+    pytest.param("tl", marks=pytest.mark.slow()),
+    pytest.param("tr", marks=pytest.mark.slow()),
+    pytest.param("tt", marks=pytest.mark.slow()),
+    pytest.param("ur", marks=pytest.mark.slow()),
+]
+
+
+@pytest.mark.parametrize("lang", LANGUAGES)
+def test_tokenizer_explain(lang):
+    tokenizer = get_lang_class(lang).Defaults.create_tokenizer()
+    examples = pytest.importorskip("spacy.lang.{}.examples".format(lang))
+    for sentence in examples.sentences:
+        tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
+        debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
+        assert tokens == debug_tokens
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -15,6 +15,8 @@ import re
 from .tokens.doc cimport Doc
 from .strings cimport hash_string
 from .compat import unescape_unicode
+from .attrs import intify_attrs
+from .symbols import ORTH

 from .errors import Errors, Warnings, deprecation_warning
 from . import util
@ -57,9 +59,7 @@ cdef class Tokenizer:
        self.infix_finditer = infix_finditer
        self.vocab = vocab
        self._rules = {}
-        if rules is not None:
-            for chunk, substrings in sorted(rules.items()):
-                self.add_special_case(chunk, substrings)
+        self._load_special_tokenization(rules)

    property token_match:
        def __get__(self):
@ -93,6 +93,18 @@ cdef class Tokenizer:
            self._infix_finditer = infix_finditer
            self._flush_cache()

+    property rules:
+        def __get__(self):
+            return self._rules
+
+        def __set__(self, rules):
+            self._rules = {}
+            self._reset_cache([key for key in self._cache])
+            self._reset_specials()
+            self._cache = PreshMap()
+            self._specials = PreshMap()
+            self._load_special_tokenization(rules)
+
    def __reduce__(self):
        args = (self.vocab,
                self._rules,
@ -227,10 +239,6 @@ cdef class Tokenizer:
        cdef unicode minus_suf
        cdef size_t last_size = 0
        while string and len(string) != last_size:
-            if self.token_match and self.token_match(string) \
-                    and not self.find_prefix(string) \
-                    and not self.find_suffix(string):
-                break
            if self._specials.get(hash_string(string)) != NULL:
                has_special[0] = 1
                break
@ -393,8 +401,9 @@ cdef class Tokenizer:

    def _load_special_tokenization(self, special_cases):
        """Add special-case tokenization rules."""
-        for chunk, substrings in sorted(special_cases.items()):
-            self.add_special_case(chunk, substrings)
+        if special_cases is not None:
+            for chunk, substrings in sorted(special_cases.items()):
+                self.add_special_case(chunk, substrings)

    def add_special_case(self, unicode string, substrings):
        """Add a special-case tokenization rule.
@ -423,6 +432,73 @@ cdef class Tokenizer:
            self.mem.free(stale_cached)
        self._rules[string] = substrings

+    def explain(self, text):
+        """A debugging tokenizer that provides information about which
+        tokenizer rule or pattern was matched for each token. The tokens
+        produced are identical to `nlp.tokenizer()` except for whitespace
+        tokens.
+
+        string (unicode): The string to tokenize.
+        RETURNS (list): A list of (pattern_string, token_string) tuples
+
+        DOCS: https://spacy.io/api/tokenizer#explain
+        """
+        prefix_search = self.prefix_search
+        suffix_search = self.suffix_search
+        infix_finditer = self.infix_finditer
+        token_match = self.token_match
+        special_cases = {}
+        for orth, special_tokens in self.rules.items():
+            special_cases[orth] = [intify_attrs(special_token, strings_map=self.vocab.strings, _do_deprecated=True) for special_token in special_tokens]
+        tokens = []
+        for substring in text.split():
+            suffixes = []
+            while substring:
+                while prefix_search(substring) or suffix_search(substring):
+                    if substring in special_cases:
+                        tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
+                        substring = ''
+                        break
+                    if prefix_search(substring):
+                        split = prefix_search(substring).end()
+                        # break if pattern matches the empty string
+                        if split == 0:
+                            break
+                        tokens.append(("PREFIX", substring[:split]))
+                        substring = substring[split:]
+                        if substring in special_cases:
+                            continue
+                    if suffix_search(substring):
+                        split = suffix_search(substring).start()
+                        # break if pattern matches the empty string
+                        if split == len(substring):
+                            break
+                        suffixes.append(("SUFFIX", substring[split:]))
+                        substring = substring[:split]
+                if substring in special_cases:
+                    tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
+                    substring = ''
+                elif token_match(substring):
+                    tokens.append(("TOKEN_MATCH", substring))
+                    substring = ''
+                elif list(infix_finditer(substring)):
+                    infixes = infix_finditer(substring)
+                    offset = 0
+                    for match in infixes:
+                        if substring[offset : match.start()]:
+                            tokens.append(("TOKEN", substring[offset : match.start()]))
+                        if substring[match.start() : match.end()]:
+                            tokens.append(("INFIX", substring[match.start() : match.end()]))
+                        offset = match.end()
+                    if substring[offset:]:
+                        tokens.append(("TOKEN", substring[offset:]))
+                    substring = ''
+                elif substring:
+                    tokens.append(("TOKEN", substring))
+                    substring = ''
+            tokens.extend(reversed(suffixes))
+        return tokens
+
    def to_disk(self, path, **kwargs):
        """Save the current state to a directory.

@ -507,8 +583,7 @@ cdef class Tokenizer:
            self._reset_specials()
            self._cache = PreshMap()
            self._specials = PreshMap()
-            for string, substrings in data.get("rules", {}).items():
-                self.add_special_case(string, substrings)
+            self._load_special_tokenization(data.get("rules", {}))

        return self

--- a/spacy/util.py
+++ b/spacy/util.py
@ -301,13 +301,13 @@ def get_component_name(component):
    return repr(component)


-def get_cuda_stream(require=False):
+def get_cuda_stream(require=False, non_blocking=True):
    if CudaStream is None:
        return None
    elif isinstance(Model.ops, NumpyOps):
        return None
    else:
-        return CudaStream()
+        return CudaStream(non_blocking=non_blocking)


 def get_async(stream, numpy_array):
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -265,17 +265,12 @@ cdef class Vectors:
            rows = [self.key2row.get(key, -1.) for key in keys]
            return xp.asarray(rows, dtype="i")
        else:
-            targets = set()
+            row2key = {row: key for key, row in self.key2row.items()}
            if row is not None:
-                targets.add(row)
+                return row2key[row]
            else:
-                targets.update(rows)
-            results = []
-            for key, row in self.key2row.items():
-                if row in targets:
-                    results.append(key)
-                    targets.remove(row)
-            return xp.asarray(results, dtype="uint64")
+                results = [row2key[row] for row in rows]
+                return xp.asarray(results, dtype="uint64")

    def add(self, key, *, vector=None, row=None):
        """Add a key to the table. Keys can be mapped to an existing vector
--- a/website/docs/api/scorer.md
+++ b/website/docs/api/scorer.md
@ -58,4 +58,5 @@ Update the evaluation scores from a single [`Doc`](/api/doc) /
 | `ents_per_type` <Tag variant="new">2.1.5</Tag>  | dict  | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores.                                                                     |
 | `textcat_score` <Tag variant="new">2.2</Tag>    | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
 | `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict  | Scores per textcat label, keyed by label.                                                                                                                 |
+| `las_per_type` <Tag variant="new">2.2.3</Tag>   | dict  | Labelled dependency scores, keyed by label.                                                                                                               |
 | `scores`                                        | dict  | All scores, keyed by type.                                                                                                                                |
--- a/website/docs/api/tokenizer.md
+++ b/website/docs/api/tokenizer.md
@ -34,15 +34,15 @@ the
 > tokenizer = nlp.Defaults.create_tokenizer(nlp)
 > ```

-| Name             | Type        | Description                                                                         |
-| ---------------- | ----------- | ----------------------------------------------------------------------------------- |
-| `vocab`          | `Vocab`     | A storage container for lexical types.                                              |
-| `rules`          | dict        | Exceptions and special-cases for the tokenizer.                                     |
-| `prefix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match prefixes. |
-| `suffix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match suffixes. |
-| `infix_finditer` | callable    | A function matching the signature of `re.compile(string).finditer` to find infixes. |
-| `token_match`    | callable    | A boolean function matching strings to be recognized as tokens.                     |
-| **RETURNS**      | `Tokenizer` | The newly constructed object.                                                       |
+| Name             | Type        | Description                                                                                                                   |
+| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`          | `Vocab`     | A storage container for lexical types.                                                                                        |
+| `rules`          | dict        | Exceptions and special-cases for the tokenizer.                                                                               |
+| `prefix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match prefixes.                                           |
+| `suffix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match suffixes.                                           |
+| `infix_finditer` | callable    | A function matching the signature of `re.compile(string).finditer` to find infixes.                                           |
+| `token_match`    | callable    | A function matching the signature of `re.compile(string).match to find token matches.                                         |
+| **RETURNS**      | `Tokenizer` | The newly constructed object.                                                                                                 |

 ## Tokenizer.\_\_call\_\_ {#call tag="method"}

@ -128,6 +128,25 @@ and examples.
 | `string`      | unicode  | The string to specially tokenize.                                                                                                                                        |
 | `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |

+## Tokenizer.explain {#explain tag="method"}
+
+Tokenize a string with a slow debugging tokenizer that provides information
+about which tokenizer rule or pattern was matched for each token. The tokens
+produced are identical to `Tokenizer.__call__` except for whitespace tokens.
+
+> #### Example
+>
+> ```python
+> tok_exp = nlp.tokenizer.explain("(don't)")
+> assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
+> assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
+> ```
+
+| Name        | Type     | Description                                         |
+| ------------| -------- | --------------------------------------------------- |
+| `string`    | unicode  | The string to tokenize with the debugging tokenizer |
+| **RETURNS** | list     | A list of `(pattern_string, token_string)` tuples   |
+
 ## Tokenizer.to_disk {#to_disk tag="method"}

 Serialize the tokenizer to disk.
@ -198,12 +217,14 @@ it.

 ## Attributes {#attributes}

-| Name             | Type    | Description                                                                                                                |
-| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
-| `vocab`          | `Vocab` | The vocab object of the parent `Doc`.                                                                                      |
-| `prefix_search`  | -       | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`.            |
-| `suffix_search`  | -       | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`.              |
-| `infix_finditer` | -       | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects. |
+| Name             | Type    | Description                                                                                                                 |
+| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- |
+| `vocab`          | `Vocab` | The vocab object of the parent `Doc`.                                                                                       |
+| `prefix_search`  | -       | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`.             |
+| `suffix_search`  | -       | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`.               |
+| `infix_finditer` | -       | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects.  |
+| `token_match`    | -       | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. |
+| `rules`          | dict        | A dictionary of tokenizer exceptions and special cases.                                                                  |

 ## Serialization fields {#serialization-fields}

--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -792,6 +792,33 @@ The algorithm can be summarized as follows:
   tokens on all infixes.
 8. Once we can't consume any more of the string, handle it as a single token.

+#### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
+
+A working implementation of the pseudo-code above is available for debugging as
+[`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
+tuples showing which tokenizer rule or pattern was matched for each token. The
+tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
+
+```python
+### {executable="true"}
+from spacy.lang.en import English
+
+nlp = English()
+text = '''"Let's go!"'''
+doc = nlp(text)
+tok_exp = nlp.tokenizer.explain(text)
+assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
+for t in tok_exp:
+    print(t[1], "\\t", t[0])
+
+# " 	 PREFIX
+# Let 	 SPECIAL-1
+# 's 	 SPECIAL-2
+# go 	 TOKEN
+# ! 	 SUFFIX
+# " 	 SUFFIX
+```
+
 ### Customizing spaCy's Tokenizer class {#native-tokenizers}

 Let's imagine you wanted to create a tokenizer for a new language or specific
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1679,13 +1679,14 @@
            "slogan": "Information extraction from English and German texts based on predicate logic",
            "github": "msg-systems/holmes-extractor",
            "url": "https://github.com/msg-systems/holmes-extractor",
-            "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
+            "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.xt.msg.team).",
            "pip": "holmes-extractor",
            "category": ["conversational", "standalone"],
            "tags": ["chatbots", "text-processing"],
+            "thumb": "https://raw.githubusercontent.com/msg-systems/holmes-extractor/master/docs/holmes_thumbnail.png",
            "code_example": [
                "import holmes_extractor as holmes",
-                "holmes_manager = holmes.Manager(model='en_coref_lg')",
+                "holmes_manager = holmes.Manager(model='en_core_web_lg')",
                "holmes_manager.register_search_phrase('A big dog chases a cat')",
                "holmes_manager.start_chatbot_mode_console()"
            ],