Merge branch 'master' into spacy.io

2025-08-04 04:10:20 +03:00 · 2019-11-21 19:23:19 +01:00 · 2019-11-21 19:23:19 +01:00 · 02de21d8b4
commit 02de21d8b4
parent 534c4aa55b a0fb1acb10
31 changed files with 893 additions and 126 deletions
--- a/.github/contributors/GuiGel.md
+++ b/.github/contributors/GuiGel.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Guillaume Gelabert   |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2019-11-15           |
 | GitHub username                | GuiGel               |
 | Website (optional)             |                      |
--- a/.github/contributors/erip.md
+++ b/.github/contributors/erip.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |      Elijah Rippeth         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |      2019-11-16    |
 | GitHub username                |           erip      |
 | Website (optional)             |                      |
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@ -50,15 +50,16 @@ jobs:
      Python36Mac:
        imageName: 'macos-10.13'
        python.version: '3.6'
-      Python37Linux:
+      # Don't test on 3.7 for now to speed up builds
-        imageName: 'ubuntu-16.04'
+      # Python37Linux:
-        python.version: '3.7'
+      #   imageName: 'ubuntu-16.04'
-      Python37Windows:
+      #   python.version: '3.7'
-        imageName: 'vs2017-win2016'
+      # Python37Windows:
-        python.version: '3.7'
+      #   imageName: 'vs2017-win2016'
-      Python37Mac:
+      #   python.version: '3.7'
-        imageName: 'macos-10.13'
+      # Python37Mac:
-        python.version: '3.7'
+      #   imageName: 'macos-10.13'
      #   python.version: '3.7'
      Python38Linux:
        imageName: 'ubuntu-16.04'
        python.version: '3.8'
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "2.2.2"
+__version__ = "2.2.3"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/errors.py
+++ b/spacy/errors.py
@ -529,6 +529,7 @@ class Errors(object):
    E185 = ("Received invalid attribute in component attribute declaration: "
            "{obj}.{attr}\nAttribute '{attr}' does not exist on {obj}.")
    E186 = ("'{tok_a}' and '{tok_b}' are different texts.")
    E187 = ("Only unicode strings are supported as labels.")
@add_codes
--- a/spacy/lang/char_classes.py
+++ b/spacy/lang/char_classes.py
@ -31,6 +31,10 @@ _latin_u_supplement = r"\u00C0-\u00D6\u00D8-\u00DE"
 _latin_l_supplement = r"\u00DF-\u00F6\u00F8-\u00FF"
 _latin_supplement = r"\u00C0-\u00D6\u00D8-\u00F6\u00F8-\u00FF"
 _hangul_syllables = r"\uAC00-\uD7AF"
 _hangul_jamo = r"\u1100-\u11FF"
 _hangul = _hangul_syllables + _hangul_jamo
 # letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
 _latin_u_extendedA = (
    r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
@ -202,7 +206,15 @@ _upper = LATIN_UPPER + _russian_upper + _tatar_upper + _greek_upper + _ukrainian
 _lower = LATIN_LOWER + _russian_lower + _tatar_lower + _greek_lower + _ukrainian_lower
 _uncased = (
-    _bengali + _hebrew + _persian + _sinhala + _hindi + _kannada + _tamil + _telugu
+    _bengali
    + _hebrew
    + _persian
    + _sinhala
    + _hindi
    + _kannada
    + _tamil
    + _telugu
    + _hangul
 )
 ALPHA = group_chars(LATIN + _russian + _tatar + _greek + _ukrainian + _uncased)
--- a/spacy/lang/ko/lex_attrs.py
+++ b/spacy/lang/ko/lex_attrs.py
@ -0,0 +1,67 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...attrs import LIKE_NUM
 _num_words = [
    "영",
    "공",
    # Native Korean number system
    "하나",
    "둘",
    "셋",
    "넷",
    "다섯",
    "여섯",
    "일곱",
    "여덟",
    "아홉",
    "열",
    "스물",
    "서른",
    "마흔",
    "쉰",
    "예순",
    "일흔",
    "여든",
    "아흔",
    # Sino-Korean number system
    "일",
    "이",
    "삼",
    "사",
    "오",
    "육",
    "칠",
    "팔",
    "구",
    "십",
    "백",
    "천",
    "만",
    "십만",
    "백만",
    "천만",
    "일억",
    "십억",
    "백억",
 ]
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if any(char.lower() in _num_words for char in text):
        return True
    return False
 LEX_ATTRS = {LIKE_NUM: like_num}
--- a/spacy/lang/lb/tokenizer_exceptions.py
+++ b/spacy/lang/lb/tokenizer_exceptions.py
@ -6,9 +6,7 @@ from ...symbols import ORTH, LEMMA, NORM
 # TODO
 # treat other apostrophes within words as part of the word: [op d'mannst], [fir d'éischt] (= exceptions)
-_exc = {
+_exc = {}
 }
 # translate / delete what is not necessary
 for exc_data in [
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -14,6 +14,7 @@ from .tag_map import TAG_MAP
 def try_jieba_import(use_jieba):
    try:
        import jieba
        return jieba
    except ImportError:
        if use_jieba:
@ -34,7 +35,9 @@ class ChineseTokenizer(DummyTokenizer):
    def __call__(self, text):
        # use jieba
        if self.use_jieba:
-            jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
+            jieba_words = list(
                [x for x in self.jieba_seg.cut(text, cut_all=False) if x]
            )
            words = [jieba_words[0]]
            spaces = [False]
            for i in range(1, len(jieba_words)):
--- a/spacy/pipeline/entityruler.py
+++ b/spacy/pipeline/entityruler.py
@ -292,13 +292,14 @@ class EntityRuler(object):
            self.add_patterns(patterns)
        else:
            cfg = {}
-            deserializers = {
+            deserializers_patterns = {
                "patterns": lambda p: self.add_patterns(
                    srsly.read_jsonl(p.with_suffix(".jsonl"))
-                ),
+                )}
-                "cfg": lambda p: cfg.update(srsly.read_json(p)),
+            deserializers_cfg = {
                "cfg": lambda p: cfg.update(srsly.read_json(p))
            }
-            from_disk(path, deserializers, {})
+            from_disk(path, deserializers_cfg, {})
            self.overwrite = cfg.get("overwrite", False)
            self.phrase_matcher_attr = cfg.get("phrase_matcher_attr")
            self.ent_id_sep = cfg.get("ent_id_sep", DEFAULT_ENT_ID_SEP)
@ -307,6 +308,7 @@ class EntityRuler(object):
                self.phrase_matcher = PhraseMatcher(
                    self.nlp.vocab, attr=self.phrase_matcher_attr
                )
            from_disk(path, deserializers_patterns, {})
        return self
    def to_disk(self, path, **kwargs):
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -13,6 +13,7 @@ from thinc.misc import LayerNorm
 from thinc.neural.util import to_categorical
 from thinc.neural.util import get_array_module
 from ..compat import basestring_
 from ..tokens.doc cimport Doc
 from ..syntax.nn_parser cimport Parser
 from ..syntax.ner cimport BiluoPushDown
@ -547,6 +548,8 @@ class Tagger(Pipe):
        return build_tagger_model(n_tags, **cfg)
    def add_label(self, label, values=None):
        if not isinstance(label, basestring_):
            raise ValueError(Errors.E187)
        if label in self.labels:
            return 0
        if self.model not in (True, False, None):
@ -1016,6 +1019,8 @@ class TextCategorizer(Pipe):
        return float(mean_square_error), d_scores
    def add_label(self, label):
        if not isinstance(label, basestring_):
            raise ValueError(Errors.E187)
        if label in self.labels:
            return 0
        if self.model not in (None, True, False):
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@ -271,7 +271,9 @@ class Scorer(object):
                        self.labelled_per_dep[token.dep_.lower()] = PRFScore()
                    if token.dep_.lower() not in cand_deps_per_dep:
                        cand_deps_per_dep[token.dep_.lower()] = set()
-                    cand_deps_per_dep[token.dep_.lower()].add((gold_i, gold_head, token.dep_.lower()))
+                    cand_deps_per_dep[token.dep_.lower()].add(
                        (gold_i, gold_head, token.dep_.lower())
                    )
        if "-" not in [token[-1] for token in gold.orig_annot]:
            # Find all NER labels in gold and doc
            ent_labels = set([x[0] for x in gold_ents] + [k.label_ for k in doc.ents])
@ -304,7 +306,9 @@ class Scorer(object):
        self.tags.score_set(cand_tags, gold_tags)
        self.labelled.score_set(cand_deps, gold_deps)
        for dep in self.labelled_per_dep:
-            self.labelled_per_dep[dep].score_set(cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set()))
+            self.labelled_per_dep[dep].score_set(
                cand_deps_per_dep.get(dep, set()), gold_deps_per_dep.get(dep, set())
            )
        self.unlabelled.score_set(
            set(item[:2] for item in cand_deps), set(item[:2] for item in gold_deps)
        )
--- a/spacy/syntax/_parser_model.pyx
+++ b/spacy/syntax/_parser_model.pyx
@ -42,11 +42,17 @@ cdef WeightsC get_c_weights(model) except *:
    cdef precompute_hiddens state2vec = model.state2vec
    output.feat_weights = state2vec.get_feat_weights()
    output.feat_bias = <const float*>state2vec.bias.data
-    cdef np.ndarray vec2scores_W = model.vec2scores.W
+    cdef np.ndarray vec2scores_W
-    cdef np.ndarray vec2scores_b = model.vec2scores.b
+    cdef np.ndarray vec2scores_b
-    cdef np.ndarray class_mask = model._class_mask
+    if model.vec2scores is None:
        output.hidden_weights = NULL
        output.hidden_bias = NULL
    else:
        vec2scores_W = model.vec2scores.W
        vec2scores_b = model.vec2scores.b
        output.hidden_weights = <const float*>vec2scores_W.data
        output.hidden_bias = <const float*>vec2scores_b.data
    cdef np.ndarray class_mask = model._class_mask
    output.seen_classes = <const float*>class_mask.data
    return output
@ -54,6 +60,9 @@ cdef WeightsC get_c_weights(model) except *:
 cdef SizesC get_c_sizes(model, int batch_size) except *:
    cdef SizesC output
    output.states = batch_size
    if model.vec2scores is None:
        output.classes = model.state2vec.nO
    else:
        output.classes = model.vec2scores.nO
    output.hiddens = model.state2vec.nO
    output.pieces = model.state2vec.nP
@ -105,11 +114,12 @@ cdef void resize_activations(ActivationsC* A, SizesC n) nogil:
 cdef void predict_states(ActivationsC* A, StateC** states,
        const WeightsC* W, SizesC n) nogil:
    cdef double one = 1.0
    resize_activations(A, n)
    memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
    memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
    for i in range(n.states):
        states[i].set_context_tokens(&A.token_ids[i*n.feats], n.feats)
    memset(A.unmaxed, 0, n.states * n.hiddens * n.pieces * sizeof(float))
    memset(A.hiddens, 0, n.states * n.hiddens * sizeof(float))
    sum_state_features(A.unmaxed,
        W.feat_weights, A.token_ids, n.states, n.feats, n.hiddens * n.pieces)
    for i in range(n.states):
@ -120,7 +130,9 @@ cdef void predict_states(ActivationsC* A, StateC** states,
            which = Vec.arg_max(&A.unmaxed[index], n.pieces)
            A.hiddens[i*n.hiddens + j] = A.unmaxed[index + which]
    memset(A.scores, 0, n.states * n.classes * sizeof(float))
-    cdef double one = 1.0
+    if W.hidden_weights == NULL:
        memcpy(A.scores, A.hiddens, n.states * n.classes * sizeof(float))
    else:
        # Compute hidden-to-output
        blis.cy.gemm(blis.cy.NO_TRANSPOSE, blis.cy.TRANSPOSE,
            n.states, n.classes, n.hiddens, one,
@ -219,7 +231,9 @@ cdef int arg_max_if_valid(const weight_t* scores, const int* is_valid, int n) no
 class ParserModel(Model):
    def __init__(self, tok2vec, lower_model, upper_model, unseen_classes=None):
        Model.__init__(self)
-        self._layers = [tok2vec, lower_model, upper_model]
+        self._layers = [tok2vec, lower_model]
        if upper_model is not None:
            self._layers.append(upper_model)
        self.unseen_classes = set()
        if unseen_classes:
            for class_ in unseen_classes:
@ -234,6 +248,8 @@ class ParserModel(Model):
        return step_model, finish_parser_update
    def resize_output(self, new_output):
        if len(self._layers) == 2:
            return 
        if new_output == self.upper.nO:
            return
        smaller = self.upper
@ -275,11 +291,23 @@ class ParserModel(Model):
 class ParserStepModel(Model):
    def __init__(self, docs, layers, unseen_classes=None, drop=0.):
        self.tokvecs, self.bp_tokvecs = layers[0].begin_update(docs, drop=drop)
        if layers[1].nP >= 2:
            activation = "maxout"
        elif len(layers) == 2:
            activation = None
        else:
            activation = "relu"
        self.state2vec = precompute_hiddens(len(docs), self.tokvecs, layers[1],
-                                            drop=drop)
+                                            activation=activation, drop=drop)
        if len(layers) == 3:
            self.vec2scores = layers[-1]
-        self.cuda_stream = util.get_cuda_stream()
+        else:
            self.vec2scores = None
        self.cuda_stream = util.get_cuda_stream(non_blocking=True)
        self.backprops = []
        if self.vec2scores is None:
            self._class_mask = numpy.zeros((self.state2vec.nO,), dtype='f')
        else:
            self._class_mask = numpy.zeros((self.vec2scores.nO,), dtype='f')
        self._class_mask.fill(1)
        if unseen_classes is not None:
@ -302,10 +330,15 @@ class ParserStepModel(Model):
    def begin_update(self, states, drop=0.):
        token_ids = self.get_token_ids(states)
        vector, get_d_tokvecs = self.state2vec.begin_update(token_ids, drop=0.0)
        if self.vec2scores is not None:
            mask = self.vec2scores.ops.get_dropout_mask(vector.shape, drop)
            if mask is not None:
                vector *= mask
            scores, get_d_vector = self.vec2scores.begin_update(vector, drop=drop)
        else:
            scores = NumpyOps().asarray(vector)
            get_d_vector = lambda d_scores, sgd=None: d_scores
            mask = None
        # If the class is unseen, make sure its score is minimum
        scores[:, self._class_mask == 0] = numpy.nanmin(scores)
@ -342,12 +375,12 @@ class ParserStepModel(Model):
        return ids
    def make_updates(self, sgd):
        # Tells CUDA to block, so our async copies complete.
        if self.cuda_stream is not None:
            self.cuda_stream.synchronize()
        # Add a padding vector to the d_tokvecs gradient, so that missing
        # values don't affect the real gradient.
        d_tokvecs = self.ops.allocate((self.tokvecs.shape[0]+1, self.tokvecs.shape[1]))
        # Tells CUDA to block, so our async copies complete.
        if self.cuda_stream is not None:
            self.cuda_stream.synchronize()
        for ids, d_vector, bp_vector in self.backprops:
            d_state_features = bp_vector((d_vector, ids), sgd=sgd)
            ids = ids.flatten()
@ -385,9 +418,10 @@ cdef class precompute_hiddens:
    cdef np.ndarray bias
    cdef object _cuda_stream
    cdef object _bp_hiddens
    cdef object activation
    def __init__(self, batch_size, tokvecs, lower_model, cuda_stream=None,
-                 drop=0.):
+                 activation="maxout", drop=0.):
        gpu_cached, bp_features = lower_model.begin_update(tokvecs, drop=drop)
        cdef np.ndarray cached
        if not isinstance(gpu_cached, numpy.ndarray):
@ -405,6 +439,8 @@ cdef class precompute_hiddens:
        self.nP = getattr(lower_model, 'nP', 1)
        self.nO = cached.shape[2]
        self.ops = lower_model.ops
        assert activation in (None, "relu", "maxout")
        self.activation = activation
        self._is_synchronized = False
        self._cuda_stream = cuda_stream
        self._cached = cached
@ -417,7 +453,7 @@ cdef class precompute_hiddens:
        return <float*>self._cached.data
    def __call__(self, X):
-        return self.begin_update(X)[0]
+        return self.begin_update(X, drop=None)[0]
    def begin_update(self, token_ids, drop=0.):
        cdef np.ndarray state_vector = numpy.zeros(
@ -450,28 +486,35 @@ cdef class precompute_hiddens:
        else:
            ops = CupyOps()
-        if self.nP == 1:
+        if self.activation == "maxout":
            state_vector, mask = ops.maxout(state_vector)
        else:
            state_vector = state_vector.reshape(state_vector.shape[:-1])
            if self.activation == "relu":
                mask = state_vector >= 0.
                state_vector *= mask
            else:
-            state_vector, mask = ops.maxout(state_vector)
+                mask = None
        def backprop_nonlinearity(d_best, sgd=None):
            if isinstance(d_best, numpy.ndarray):
                ops = NumpyOps()
            else:
                ops = CupyOps()
            if mask is not None:
                mask_ = ops.asarray(mask)
            # This will usually be on GPU
            d_best = ops.asarray(d_best)
            # Fix nans (which can occur from unseen classes.)
            d_best[ops.xp.isnan(d_best)] = 0.
-            if self.nP == 1:
+            if self.activation == "maxout":
                mask_ = ops.asarray(mask)
                return ops.backprop_maxout(d_best, mask_, self.nP)
            elif self.activation == "relu":
                mask_ = ops.asarray(mask)
                d_best *= mask_
                d_best = d_best.reshape((d_best.shape + (1,)))
                return d_best
            else:
-                return ops.backprop_maxout(d_best, mask_, self.nP)
+                return d_best.reshape((d_best.shape + (1,)))
        return state_vector, backprop_nonlinearity
--- a/spacy/syntax/_state.pxd
+++ b/spacy/syntax/_state.pxd
@ -100,10 +100,30 @@ cdef cppclass StateC:
        free(this.shifted - PADDING)
    void set_context_tokens(int* ids, int n) nogil:
-        if n == 2:
+        if n == 1:
            if this.B(0) >= 0:
                ids[0] = this.B(0)
            else:
                ids[0] = -1
        elif n == 2:
            ids[0] = this.B(0)
            ids[1] = this.S(0)
-        if n == 8:
+        elif n == 3:
            if this.B(0) >= 0:
                ids[0] = this.B(0)
            else:
                ids[0] = -1
            # First word of entity, if any
            if this.entity_is_open():
                ids[1] = this.E(0)
            else:
                ids[1] = -1
            # Last word of entity, if within entity
            if ids[0] == -1 or ids[1] == -1:
                ids[2] = -1
            else:
                ids[2] = ids[0] - 1
        elif n == 8:
            ids[0] = this.B(0)
            ids[1] = this.B(1)
            ids[2] = this.S(0)
--- a/spacy/syntax/nn_parser.pyx
+++ b/spacy/syntax/nn_parser.pyx
@ -22,7 +22,7 @@ from thinc.extra.search cimport Beam
 from thinc.api import chain, clone
 from thinc.v2v import Model, Maxout, Affine
 from thinc.misc import LayerNorm
-from thinc.neural.ops import CupyOps
+from thinc.neural.ops import NumpyOps, CupyOps
 from thinc.neural.util import get_array_module
 from thinc.linalg cimport Vec, VecVec
 import srsly
@ -61,13 +61,17 @@ cdef class Parser:
        t2v_pieces = util.env_opt('cnn_maxout_pieces', cfg.get('cnn_maxout_pieces', 3))
        bilstm_depth = util.env_opt('bilstm_depth', cfg.get('bilstm_depth', 0))
        self_attn_depth = util.env_opt('self_attn_depth', cfg.get('self_attn_depth', 0))
-        if depth != 1:
+        nr_feature_tokens = cfg.get("nr_feature_tokens", cls.nr_feature)
        if depth not in (0, 1):
            raise ValueError(TempErrors.T004.format(value=depth))
        parser_maxout_pieces = util.env_opt('parser_maxout_pieces',
                                            cfg.get('maxout_pieces', 2))
        token_vector_width = util.env_opt('token_vector_width',
                                           cfg.get('token_vector_width', 96))
        hidden_width = util.env_opt('hidden_width', cfg.get('hidden_width', 64))
        if depth == 0:
            hidden_width = nr_class
            parser_maxout_pieces = 1
        embed_size = util.env_opt('embed_size', cfg.get('embed_size', 2000))
        pretrained_vectors = cfg.get('pretrained_vectors', None)
        tok2vec = Tok2Vec(token_vector_width, embed_size,
@ -80,16 +84,19 @@ cdef class Parser:
        tok2vec = chain(tok2vec, flatten)
        tok2vec.nO = token_vector_width
        lower = PrecomputableAffine(hidden_width,
-                    nF=cls.nr_feature, nI=token_vector_width,
+                    nF=nr_feature_tokens, nI=token_vector_width,
                    nP=parser_maxout_pieces)
        lower.nP = parser_maxout_pieces
-
+        if depth == 1:
            with Model.use_device('cpu'):
                upper = Affine(nr_class, hidden_width, drop_factor=0.0)
            upper.W *= 0
        else:
            upper = None
        cfg = {
            'nr_class': nr_class,
            'nr_feature_tokens': nr_feature_tokens,
            'hidden_depth': depth,
            'token_vector_width': token_vector_width,
            'hidden_width': hidden_width,
@ -133,6 +140,7 @@ cdef class Parser:
        if 'beam_update_prob' not in cfg:
            cfg['beam_update_prob'] = util.env_opt('beam_update_prob', 1.0)
        cfg.setdefault('cnn_maxout_pieces', 3)
        cfg.setdefault("nr_feature_tokens", self.nr_feature)
        self.cfg = cfg
        self.model = model
        self._multitasks = []
@ -299,7 +307,7 @@ cdef class Parser:
        token_ids = numpy.zeros((len(docs) * beam_width, self.nr_feature),
                                 dtype='i', order='C')
        cdef int* c_ids
-        cdef int nr_feature = self.nr_feature
+        cdef int nr_feature = self.cfg["nr_feature_tokens"]
        cdef int n_states
        model = self.model(docs)
        todo = [beam for beam in beams if not beam.is_done]
@ -502,7 +510,7 @@ cdef class Parser:
            self.moves.preprocess_gold(gold)
        model, finish_update = self.model.begin_update(docs, drop=drop)
        states_d_scores, backprops, beams = _beam_utils.update_beam(
-            self.moves, self.nr_feature, 10000, states, golds, model.state2vec,
+            self.moves, self.cfg["nr_feature_tokens"], 10000, states, golds, model.state2vec,
            model.vec2scores, width, drop=drop, losses=losses,
            beam_density=beam_density)
        for i, d_scores in enumerate(states_d_scores):
--- a/spacy/tests/lang/en/test_customized_tokenizer.py
+++ b/spacy/tests/lang/en/test_customized_tokenizer.py
@ -2,6 +2,7 @@
 from __future__ import unicode_literals
 import pytest
 import re
 from spacy.lang.en import English
 from spacy.tokenizer import Tokenizer
 from spacy.util import compile_prefix_regex, compile_suffix_regex
@ -19,13 +20,14 @@ def custom_en_tokenizer(en_vocab):
        r"[\[\]!&:,()\*—–\/-]",
    ]
    infix_re = compile_infix_regex(custom_infixes)
    token_match_re = re.compile("a-b")
    return Tokenizer(
        en_vocab,
        English.Defaults.tokenizer_exceptions,
        prefix_re.search,
        suffix_re.search,
        infix_re.finditer,
-        token_match=None,
+        token_match=token_match_re.match,
    )
@ -74,3 +76,81 @@ def test_en_customized_tokenizer_handles_infixes(custom_en_tokenizer):
        "Megaregion",
        ".",
    ]
 def test_en_customized_tokenizer_handles_token_match(custom_en_tokenizer):
    sentence = "The 8 and 10-county definitions a-b not used for the greater Southern California Megaregion."
    context = [word.text for word in custom_en_tokenizer(sentence)]
    assert context == [
        "The",
        "8",
        "and",
        "10",
        "-",
        "county",
        "definitions",
        "a-b",
        "not",
        "used",
        "for",
        "the",
        "greater",
        "Southern",
        "California",
        "Megaregion",
        ".",
    ]
 def test_en_customized_tokenizer_handles_rules(custom_en_tokenizer):
    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
    context = [word.text for word in custom_en_tokenizer(sentence)]
    assert context == [
        "The",
        "8",
        "and",
        "10",
        "-",
        "county",
        "definitions",
        "are",
        "not",
        "used",
        "for",
        "the",
        "greater",
        "Southern",
        "California",
        "Megaregion",
        ".",
        ":)",
    ]
 def test_en_customized_tokenizer_handles_rules_property(custom_en_tokenizer):
    sentence = "The 8 and 10-county definitions are not used for the greater Southern California Megaregion. :)"
    rules = custom_en_tokenizer.rules
    del rules[":)"]
    custom_en_tokenizer.rules = rules
    context = [word.text for word in custom_en_tokenizer(sentence)]
    assert context == [
        "The",
        "8",
        "and",
        "10",
        "-",
        "county",
        "definitions",
        "are",
        "not",
        "used",
        "for",
        "the",
        "greater",
        "Southern",
        "California",
        "Megaregion",
        ".",
        ":",
        ")",
    ]
--- a/spacy/tests/parser/test_ner.py
+++ b/spacy/tests/parser/test_ner.py
@ -259,6 +259,27 @@ def test_block_ner():
    assert [token.ent_type_ for token in doc] == expected_types
 def test_change_number_features():
    # Test the default number features
    nlp = English()
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
    ner.add_label("PERSON")
    nlp.begin_training()
    assert ner.model.lower.nF == ner.nr_feature
    # Test we can change it
    nlp = English()
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner)
    ner.add_label("PERSON")
    nlp.begin_training(
        component_cfg={"ner": {"nr_feature_tokens": 3, "token_vector_width": 128}}
    )
    assert ner.model.lower.nF == 3
    # Test the model runs
    nlp("hello world")
 class BlockerComponent1(object):
    name = "my_blocker"
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -0,0 +1,14 @@
 # coding: utf8
 from __future__ import unicode_literals
 import pytest
 from spacy.language import Language
 from spacy.pipeline import Tagger
 def test_label_types():
    nlp = Language()
    nlp.add_pipe(nlp.create_pipe("tagger"))
    nlp.get_pipe("tagger").add_label("A")
    with pytest.raises(ValueError):
        nlp.get_pipe("tagger").add_label(9)
--- a/spacy/tests/pipeline/test_textcat.py
+++ b/spacy/tests/pipeline/test_textcat.py
@ -62,3 +62,11 @@ def test_textcat_learns_multilabel():
                    assert score < 0.5
                else:
                    assert score > 0.5
 def test_label_types():
    nlp = Language()
    nlp.add_pipe(nlp.create_pipe("textcat"))
    nlp.get_pipe("textcat").add_label("answer")
    with pytest.raises(ValueError):
        nlp.get_pipe("textcat").add_label(9)
--- a/spacy/tests/regression/test_issue4402.py
+++ b/spacy/tests/regression/test_issue4402.py
@ -3,9 +3,9 @@ from __future__ import unicode_literals
 import srsly
 from spacy.gold import GoldCorpus
 from spacy.lang.en import English
-from spacy.tests.util import make_tempdir
+
 from ..util import make_tempdir
 def test_issue4402():
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -1,7 +1,6 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from mock import Mock
 from spacy.matcher import DependencyMatcher
 from ..util import get_doc
@ -11,8 +10,14 @@ def test_issue4590(en_vocab):
    """Test that matches param in on_match method are the same as matches run with no on_match method"""
    pattern = [
        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
-        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+        {
-        {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+            "SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
        {
            "SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"},
            "PATTERN": {"ORTH": "fox"},
        },
    ]
    on_match = Mock()
@ -31,4 +36,3 @@ def test_issue4590(en_vocab):
    on_match_args = on_match.call_args
    assert on_match_args[0][3] == matches
--- a/spacy/tests/regression/test_issue4651.py
+++ b/spacy/tests/regression/test_issue4651.py
@ -0,0 +1,65 @@
 # coding: utf-8
 from __future__ import unicode_literals
 from spacy.lang.en import English
 from spacy.pipeline import EntityRuler
 from ..util import make_tempdir
 def test_issue4651_with_phrase_matcher_attr():
    """Test that the EntityRuler PhraseMatcher is deserialize correctly using
    the method from_disk when the EntityRuler argument phrase_matcher_attr is
    specified.
    """
    text = "Spacy is a python library for nlp"
    nlp = English()
    ruler = EntityRuler(nlp, phrase_matcher_attr="LOWER")
    patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    doc = nlp(text)
    res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
    nlp_reloaded = English()
    with make_tempdir() as d:
        file_path = d / "entityruler"
        ruler.to_disk(file_path)
        ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
    nlp_reloaded.add_pipe(ruler_reloaded)
    doc_reloaded = nlp_reloaded(text)
    res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
    assert res == res_reloaded
 def test_issue4651_without_phrase_matcher_attr():
    """Test that the EntityRuler PhraseMatcher is deserialize correctly using
    the method from_disk when the EntityRuler argument phrase_matcher_attr is
    not specified.
    """
    text = "Spacy is a python library for nlp"
    nlp = English()
    ruler = EntityRuler(nlp)
    patterns = [{"label": "PYTHON_LIB", "pattern": "spacy", "id": "spaCy"}]
    ruler.add_patterns(patterns)
    nlp.add_pipe(ruler)
    doc = nlp(text)
    res = [(ent.text, ent.label_, ent.ent_id_) for ent in doc.ents]
    nlp_reloaded = English()
    with make_tempdir() as d:
        file_path = d / "entityruler"
        ruler.to_disk(file_path)
        ruler_reloaded = EntityRuler(nlp_reloaded).from_disk(file_path)
    nlp_reloaded.add_pipe(ruler_reloaded)
    doc_reloaded = nlp_reloaded(text)
    res_reloaded = [(ent.text, ent.label_, ent.ent_id_) for ent in doc_reloaded.ents]
    assert res == res_reloaded
--- a/spacy/tests/test_scorer.py
+++ b/spacy/tests/test_scorer.py
@ -12,8 +12,22 @@ from .util import get_doc
 test_las_apple = [
    [
        "Apple is looking at buying U.K. startup for $ 1 billion",
-        {"heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
+        {
-         "deps": ['nsubj', 'aux', 'ROOT', 'prep', 'pcomp', 'compound', 'dobj', 'prep', 'quantmod', 'compound', 'pobj']},
+            "heads": [2, 2, 2, 2, 3, 6, 4, 4, 10, 10, 7],
            "deps": [
                "nsubj",
                "aux",
                "ROOT",
                "prep",
                "pcomp",
                "compound",
                "dobj",
                "prep",
                "quantmod",
                "compound",
                "pobj",
            ],
        },
    ]
 ]
@ -59,7 +73,7 @@ def test_las_per_type(en_vocab):
            en_vocab,
            words=input_.split(" "),
            heads=([h - i for i, h in enumerate(annot["heads"])]),
-            deps=annot["deps"]
+            deps=annot["deps"],
        )
        gold = GoldParse(doc, heads=annot["heads"], deps=annot["deps"])
        doc[0].dep_ = "compound"
--- a/spacy/tests/tokenizer/test_explain.py
+++ b/spacy/tests/tokenizer/test_explain.py
@ -0,0 +1,65 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 from spacy.util import get_lang_class
 # Only include languages with no external dependencies
 # "is" seems to confuse importlib, so we're also excluding it for now
 # excluded: ja, ru, th, uk, vi, zh, is
 LANGUAGES = [
    pytest.param("fr", marks=pytest.mark.slow()),
    pytest.param("af", marks=pytest.mark.slow()),
    pytest.param("ar", marks=pytest.mark.slow()),
    pytest.param("bg", marks=pytest.mark.slow()),
    "bn",
    pytest.param("ca", marks=pytest.mark.slow()),
    pytest.param("cs", marks=pytest.mark.slow()),
    pytest.param("da", marks=pytest.mark.slow()),
    pytest.param("de", marks=pytest.mark.slow()),
    "el",
    "en",
    pytest.param("es", marks=pytest.mark.slow()),
    pytest.param("et", marks=pytest.mark.slow()),
    pytest.param("fa", marks=pytest.mark.slow()),
    pytest.param("fi", marks=pytest.mark.slow()),
    "fr",
    pytest.param("ga", marks=pytest.mark.slow()),
    pytest.param("he", marks=pytest.mark.slow()),
    pytest.param("hi", marks=pytest.mark.slow()),
    pytest.param("hr", marks=pytest.mark.slow()),
    "hu",
    pytest.param("id", marks=pytest.mark.slow()),
    pytest.param("it", marks=pytest.mark.slow()),
    pytest.param("kn", marks=pytest.mark.slow()),
    pytest.param("lb", marks=pytest.mark.slow()),
    pytest.param("lt", marks=pytest.mark.slow()),
    pytest.param("lv", marks=pytest.mark.slow()),
    pytest.param("nb", marks=pytest.mark.slow()),
    pytest.param("nl", marks=pytest.mark.slow()),
    "pl",
    pytest.param("pt", marks=pytest.mark.slow()),
    pytest.param("ro", marks=pytest.mark.slow()),
    pytest.param("si", marks=pytest.mark.slow()),
    pytest.param("sk", marks=pytest.mark.slow()),
    pytest.param("sl", marks=pytest.mark.slow()),
    pytest.param("sq", marks=pytest.mark.slow()),
    pytest.param("sr", marks=pytest.mark.slow()),
    pytest.param("sv", marks=pytest.mark.slow()),
    pytest.param("ta", marks=pytest.mark.slow()),
    pytest.param("te", marks=pytest.mark.slow()),
    pytest.param("tl", marks=pytest.mark.slow()),
    pytest.param("tr", marks=pytest.mark.slow()),
    pytest.param("tt", marks=pytest.mark.slow()),
    pytest.param("ur", marks=pytest.mark.slow()),
 ]
@pytest.mark.parametrize("lang", LANGUAGES)
 def test_tokenizer_explain(lang):
    tokenizer = get_lang_class(lang).Defaults.create_tokenizer()
    examples = pytest.importorskip("spacy.lang.{}.examples".format(lang))
    for sentence in examples.sentences:
        tokens = [t.text for t in tokenizer(sentence) if not t.is_space]
        debug_tokens = [t[1] for t in tokenizer.explain(sentence)]
        assert tokens == debug_tokens
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@ -15,6 +15,8 @@ import re
 from .tokens.doc cimport Doc
 from .strings cimport hash_string
 from .compat import unescape_unicode
 from .attrs import intify_attrs
 from .symbols import ORTH
 from .errors import Errors, Warnings, deprecation_warning
 from . import util
@ -57,9 +59,7 @@ cdef class Tokenizer:
        self.infix_finditer = infix_finditer
        self.vocab = vocab
        self._rules = {}
-        if rules is not None:
+        self._load_special_tokenization(rules)
            for chunk, substrings in sorted(rules.items()):
                self.add_special_case(chunk, substrings)
    property token_match:
        def __get__(self):
@ -93,6 +93,18 @@ cdef class Tokenizer:
            self._infix_finditer = infix_finditer
            self._flush_cache()
    property rules:
        def __get__(self):
            return self._rules
        def __set__(self, rules):
            self._rules = {}
            self._reset_cache([key for key in self._cache])
            self._reset_specials()
            self._cache = PreshMap()
            self._specials = PreshMap()
            self._load_special_tokenization(rules)
    def __reduce__(self):
        args = (self.vocab,
                self._rules,
@ -227,10 +239,6 @@ cdef class Tokenizer:
        cdef unicode minus_suf
        cdef size_t last_size = 0
        while string and len(string) != last_size:
            if self.token_match and self.token_match(string) \
                    and not self.find_prefix(string) \
                    and not self.find_suffix(string):
                break
            if self._specials.get(hash_string(string)) != NULL:
                has_special[0] = 1
                break
@ -393,6 +401,7 @@ cdef class Tokenizer:
    def _load_special_tokenization(self, special_cases):
        """Add special-case tokenization rules."""
        if special_cases is not None:
            for chunk, substrings in sorted(special_cases.items()):
                self.add_special_case(chunk, substrings)
@ -423,6 +432,73 @@ cdef class Tokenizer:
            self.mem.free(stale_cached)
        self._rules[string] = substrings
    def explain(self, text):
        """A debugging tokenizer that provides information about which
        tokenizer rule or pattern was matched for each token. The tokens
        produced are identical to `nlp.tokenizer()` except for whitespace
        tokens.
        string (unicode): The string to tokenize.
        RETURNS (list): A list of (pattern_string, token_string) tuples
        DOCS: https://spacy.io/api/tokenizer#explain
        """
        prefix_search = self.prefix_search
        suffix_search = self.suffix_search
        infix_finditer = self.infix_finditer
        token_match = self.token_match
        special_cases = {}
        for orth, special_tokens in self.rules.items():
            special_cases[orth] = [intify_attrs(special_token, strings_map=self.vocab.strings, _do_deprecated=True) for special_token in special_tokens]
        tokens = []
        for substring in text.split():
            suffixes = []
            while substring:
                while prefix_search(substring) or suffix_search(substring):
                    if substring in special_cases:
                        tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
                        substring = ''
                        break
                    if prefix_search(substring):
                        split = prefix_search(substring).end()
                        # break if pattern matches the empty string
                        if split == 0:
                            break
                        tokens.append(("PREFIX", substring[:split]))
                        substring = substring[split:]
                        if substring in special_cases:
                            continue
                    if suffix_search(substring):
                        split = suffix_search(substring).start()
                        # break if pattern matches the empty string
                        if split == len(substring):
                            break
                        suffixes.append(("SUFFIX", substring[split:]))
                        substring = substring[:split]
                if substring in special_cases:
                    tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
                    substring = ''
                elif token_match(substring):
                    tokens.append(("TOKEN_MATCH", substring))
                    substring = ''
                elif list(infix_finditer(substring)):
                    infixes = infix_finditer(substring)
                    offset = 0
                    for match in infixes:
                        if substring[offset : match.start()]:
                            tokens.append(("TOKEN", substring[offset : match.start()]))
                        if substring[match.start() : match.end()]:
                            tokens.append(("INFIX", substring[match.start() : match.end()]))
                        offset = match.end()
                    if substring[offset:]:
                        tokens.append(("TOKEN", substring[offset:]))
                    substring = ''
                elif substring:
                    tokens.append(("TOKEN", substring))
                    substring = ''
            tokens.extend(reversed(suffixes))
        return tokens
    def to_disk(self, path, **kwargs):
        """Save the current state to a directory.
@ -507,8 +583,7 @@ cdef class Tokenizer:
            self._reset_specials()
            self._cache = PreshMap()
            self._specials = PreshMap()
-            for string, substrings in data.get("rules", {}).items():
+            self._load_special_tokenization(data.get("rules", {}))
                self.add_special_case(string, substrings)
        return self
--- a/spacy/util.py
+++ b/spacy/util.py
@ -301,13 +301,13 @@ def get_component_name(component):
    return repr(component)
-def get_cuda_stream(require=False):
+def get_cuda_stream(require=False, non_blocking=True):
    if CudaStream is None:
        return None
    elif isinstance(Model.ops, NumpyOps):
        return None
    else:
-        return CudaStream()
+        return CudaStream(non_blocking=non_blocking)
 def get_async(stream, numpy_array):
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@ -265,16 +265,11 @@ cdef class Vectors:
            rows = [self.key2row.get(key, -1.) for key in keys]
            return xp.asarray(rows, dtype="i")
        else:
-            targets = set()
+            row2key = {row: key for key, row in self.key2row.items()}
            if row is not None:
-                targets.add(row)
+                return row2key[row]
            else:
-                targets.update(rows)
+                results = [row2key[row] for row in rows]
            results = []
            for key, row in self.key2row.items():
                if row in targets:
                    results.append(key)
                    targets.remove(row)
                return xp.asarray(results, dtype="uint64")
    def add(self, key, *, vector=None, row=None):
--- a/website/docs/api/scorer.md
+++ b/website/docs/api/scorer.md
@ -58,4 +58,5 @@ Update the evaluation scores from a single [`Doc`](/api/doc) /
 | `ents_per_type` <Tag variant="new">2.1.5</Tag>  | dict  | Scores per entity label. Keyed by label, mapped to a dict of `p`, `r` and `f` scores.                                                                     |
 | `textcat_score` <Tag variant="new">2.2</Tag>    | float | F-score on positive label for binary exclusive, macro-averaged F-score for 3+ exclusive, macro-averaged AUC ROC score for multilabel (`-1` if undefined). |
 | `textcats_per_cat` <Tag variant="new">2.2</Tag> | dict  | Scores per textcat label, keyed by label.                                                                                                                 |
 | `las_per_type` <Tag variant="new">2.2.3</Tag>   | dict  | Labelled dependency scores, keyed by label.                                                                                                               |
 | `scores`                                        | dict  | All scores, keyed by type.                                                                                                                                |
--- a/website/docs/api/tokenizer.md
+++ b/website/docs/api/tokenizer.md
@ -35,13 +35,13 @@ the
 > ```
 | Name             | Type        | Description                                                                                                                   |
-| ---------------- | ----------- | ----------------------------------------------------------------------------------- |
+| ---------------- | ----------- | ----------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`          | `Vocab`     | A storage container for lexical types.                                                                                        |
 | `rules`          | dict        | Exceptions and special-cases for the tokenizer.                                                                               |
 | `prefix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match prefixes.                                           |
 | `suffix_search`  | callable    | A function matching the signature of `re.compile(string).search` to match suffixes.                                           |
 | `infix_finditer` | callable    | A function matching the signature of `re.compile(string).finditer` to find infixes.                                           |
-| `token_match`    | callable    | A boolean function matching strings to be recognized as tokens.                     |
+| `token_match`    | callable    | A function matching the signature of `re.compile(string).match to find token matches.                                         |
 | **RETURNS**      | `Tokenizer` | The newly constructed object.                                                                                                 |
 ## Tokenizer.\_\_call\_\_ {#call tag="method"}
@ -128,6 +128,25 @@ and examples.
 | `string`      | unicode  | The string to specially tokenize.                                                                                                                                        |
 | `token_attrs` | iterable | A sequence of dicts, where each dict describes a token and its attributes. The `ORTH` fields of the attributes must exactly match the string when they are concatenated. |
 ## Tokenizer.explain {#explain tag="method"}
 Tokenize a string with a slow debugging tokenizer that provides information
 about which tokenizer rule or pattern was matched for each token. The tokens
 produced are identical to `Tokenizer.__call__` except for whitespace tokens.
 > #### Example
 >
 > ```python
 > tok_exp = nlp.tokenizer.explain("(don't)")
 > assert [t[0] for t in tok_exp] == ["PREFIX", "SPECIAL-1", "SPECIAL-2", "SUFFIX"]
 > assert [t[1] for t in tok_exp] == ["(", "do", "n't", ")"]
 > ```
 | Name        | Type     | Description                                         |
 | ------------| -------- | --------------------------------------------------- |
 | `string`    | unicode  | The string to tokenize with the debugging tokenizer |
 | **RETURNS** | list     | A list of `(pattern_string, token_string)` tuples   |
 ## Tokenizer.to_disk {#to_disk tag="method"}
 Serialize the tokenizer to disk.
@ -199,11 +218,13 @@ it.
 ## Attributes {#attributes}
 | Name             | Type    | Description                                                                                                                 |
-| ---------------- | ------- | -------------------------------------------------------------------------------------------------------------------------- |
+| ---------------- | ------- | --------------------------------------------------------------------------------------------------------------------------- |
 | `vocab`          | `Vocab` | The vocab object of the parent `Doc`.                                                                                       |
 | `prefix_search`  | -       | A function to find segment boundaries from the start of a string. Returns the length of the segment, or `None`.             |
 | `suffix_search`  | -       | A function to find segment boundaries from the end of a string. Returns the length of the segment, or `None`.               |
 | `infix_finditer` | -       | A function to find internal segment separators, e.g. hyphens. Returns a (possibly empty) list of `re.MatchObject` objects.  |
 | `token_match`    | -       | A function matching the signature of `re.compile(string).match to find token matches. Returns an `re.MatchObject` or `None. |
 | `rules`          | dict        | A dictionary of tokenizer exceptions and special cases.                                                                  |
 ## Serialization fields {#serialization-fields}
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -792,6 +792,33 @@ The algorithm can be summarized as follows:
   tokens on all infixes.
 8. Once we can't consume any more of the string, handle it as a single token.
 #### Debugging the tokenizer {#tokenizer-debug new="2.2.3"}
 A working implementation of the pseudo-code above is available for debugging as
 [`nlp.tokenizer.explain(text)`](/api/tokenizer#explain). It returns a list of
 tuples showing which tokenizer rule or pattern was matched for each token. The
 tokens produced are identical to `nlp.tokenizer()` except for whitespace tokens:
 ```python
 ### {executable="true"}
 from spacy.lang.en import English
 nlp = English()
 text = '''"Let's go!"'''
 doc = nlp(text)
 tok_exp = nlp.tokenizer.explain(text)
 assert [t.text for t in doc if not t.is_space] == [t[1] for t in tok_exp]
 for t in tok_exp:
    print(t[1], "\\t", t[0])
 # " 	 PREFIX
 # Let 	 SPECIAL-1
 # 's 	 SPECIAL-2
 # go 	 TOKEN
 # ! 	 SUFFIX
 # " 	 SUFFIX
 ```
 ### Customizing spaCy's Tokenizer class {#native-tokenizers}
 Let's imagine you wanted to create a tokenizer for a new language or specific
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1679,13 +1679,14 @@
            "slogan": "Information extraction from English and German texts based on predicate logic",
            "github": "msg-systems/holmes-extractor",
            "url": "https://github.com/msg-systems/holmes-extractor",
-            "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural search, topic matching and supervised document classification.",
+            "description": "Holmes is a Python 3 library that supports a number of use cases involving information extraction from English and German texts, including chatbot, structural extraction, topic matching and supervised document classification. There is a [website demonstrating intelligent search based on topic matching](https://holmes-demo.xt.msg.team).",
            "pip": "holmes-extractor",
            "category": ["conversational", "standalone"],
            "tags": ["chatbots", "text-processing"],
            "thumb": "https://raw.githubusercontent.com/msg-systems/holmes-extractor/master/docs/holmes_thumbnail.png",
            "code_example": [
                "import holmes_extractor as holmes",
-                "holmes_manager = holmes.Manager(model='en_coref_lg')",
+                "holmes_manager = holmes.Manager(model='en_core_web_lg')",
                "holmes_manager.register_search_phrase('A big dog chases a cat')",
                "holmes_manager.start_chatbot_mode_console()"
            ],