Merge pull request #5755 from adrianeboyd/v2.3.x

Update v2.3.x from master
2025-07-04 20:03:13 +03:00 · 2020-07-13 15:30:40 +02:00 · 2020-07-13 15:30:40 +02:00 · bf778f59c7
commit bf778f59c7
parent 5542915588 21e2760407
17 changed files with 572 additions and 124 deletions
--- a/.github/contributors/gandersen101.md
+++ b/.github/contributors/gandersen101.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [ x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Grant Andersen       |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 07.06.2020           |
 | GitHub username                | gandersen101         |
 | Website (optional)             |                      |
--- a/.github/contributors/jbesomi.md
+++ b/.github/contributors/jbesomi.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Jonathan B.          |
 | Company name (if applicable)   | besomi.ai            |
 | Title or role (if applicable)  | -                    |
 | Date                           | 07.07.2020           |
 | GitHub username                | jbesomi              |
 | Website (optional)             | besomi.ai            |
--- a/.github/contributors/mikeizbicki.md
+++ b/.github/contributors/mikeizbicki.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Mike Izbicki         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 02 Jun 2020          |
 | GitHub username                | mikeizbicki          |
 | Website (optional)             | https://izbicki.me   |
--- a/spacy/_ml.py
+++ b/spacy/_ml.py
@ -14,7 +14,7 @@ from thinc.api import with_getitem, flatten_add_lengths
 from thinc.api import uniqued, wrap, noop
 from thinc.linear.linear import LinearModel
 from thinc.neural.ops import NumpyOps, CupyOps
-from thinc.neural.util import get_array_module, copy_array
+from thinc.neural.util import get_array_module, copy_array, to_categorical
 from thinc.neural.optimizers import Adam
 from thinc import describe
@ -840,6 +840,8 @@ def masked_language_model(vocab, model, mask_prob=0.15):
        def mlm_backward(d_output, sgd=None):
            d_output *= 1 - mask
            # Rescale gradient for number of instances.
            d_output *= mask.size - mask.sum()
            return backprop(d_output, sgd=sgd)
        return output, mlm_backward
@ -944,7 +946,7 @@ class CharacterEmbed(Model):
        # for the tip.
        nCv = self.ops.xp.arange(self.nC)
        for doc in docs:
-            doc_ids = doc.to_utf8_array(nr_char=self.nC)
+            doc_ids = self.ops.asarray(doc.to_utf8_array(nr_char=self.nC))
            doc_vectors = self.ops.allocate((len(doc), self.nC, self.nM))
            # Let's say I have a 2d array of indices, and a 3d table of data. What numpy
            # incantation do I chant to get
@ -986,3 +988,17 @@ def get_cossim_loss(yh, y, ignore_zeros=False):
        losses[zero_indices] = 0
    loss = losses.sum()
    return loss, -d_yh
 def get_characters_loss(ops, docs, prediction, nr_char=10):
    target_ids = numpy.vstack([doc.to_utf8_array(nr_char=nr_char) for doc in docs])
    target_ids = target_ids.reshape((-1,))
    target = ops.asarray(to_categorical(target_ids, nb_classes=256), dtype="f")
    target = target.reshape((-1, 256*nr_char))
    diff = prediction - target
    loss = (diff**2).sum()
    d_target = diff / float(prediction.shape[0])
    return loss, d_target
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy"
-__version__ = "2.3.1"
+__version__ = "2.3.2"
 __release__ = True
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
--- a/spacy/cli/pretrain.py
+++ b/spacy/cli/pretrain.py
@ -18,7 +18,8 @@ from ..errors import Errors
 from ..tokens import Doc
 from ..attrs import ID, HEAD
 from .._ml import Tok2Vec, flatten, chain, create_default_optimizer
-from .._ml import masked_language_model, get_cossim_loss
+from .._ml import masked_language_model, get_cossim_loss, get_characters_loss
 from .._ml import MultiSoftmax
 from .. import util
 from .train import _load_pretrained_tok2vec
@ -42,7 +43,7 @@ from .train import _load_pretrained_tok2vec
    bilstm_depth=("Depth of BiLSTM layers (requires PyTorch)", "option", "lstm", int),
    embed_rows=("Number of embedding rows", "option", "er", int),
    loss_func=(
-        "Loss function to use for the objective. Either 'L2' or 'cosine'",
+        "Loss function to use for the objective. Either 'characters', 'L2' or 'cosine'",
        "option",
        "L",
        str,
@ -85,11 +86,11 @@ def pretrain(
    output_dir,
    width=96,
    conv_depth=4,
    bilstm_depth=0,
    cnn_pieces=3,
    sa_depth=0,
    use_chars=False,
    cnn_window=1,
    bilstm_depth=0,
    use_chars=False,
    embed_rows=2000,
    loss_func="cosine",
    use_vectors=False,
@ -124,11 +125,7 @@ def pretrain(
            config[key] = str(config[key])
    util.fix_random_seed(seed)
-    has_gpu = prefer_gpu()
+    has_gpu = prefer_gpu(gpu_id=1)
    if has_gpu:
        import torch
        torch.set_default_tensor_type("torch.cuda.FloatTensor")
    msg.info("Using GPU" if has_gpu else "Not using GPU")
    output_dir = Path(output_dir)
@ -174,6 +171,7 @@ def pretrain(
            subword_features=not use_chars,  # Set to False for Chinese etc
            cnn_maxout_pieces=cnn_pieces,  # If set to 1, use Mish activation.
        ),
        objective=loss_func
    )
    # Load in pretrained weights
    if init_tok2vec is not None:
@ -264,7 +262,10 @@ def make_update(model, docs, optimizer, drop=0.0, objective="L2"):
    RETURNS loss: A float for the loss.
    """
    predictions, backprop = model.begin_update(docs, drop=drop)
-    loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
+    if objective == "characters":
        loss, gradients = get_characters_loss(model.ops, docs, predictions)
    else:
        loss, gradients = get_vectors_loss(model.ops, docs, predictions, objective)
    backprop(gradients, sgd=optimizer)
    # Don't want to return a cupy object here
    # The gradients are modified in-place by the BERT MLM,
@ -326,16 +327,23 @@ def get_vectors_loss(ops, docs, prediction, objective="L2"):
    return loss, d_target
-def create_pretraining_model(nlp, tok2vec):
+def create_pretraining_model(nlp, tok2vec, objective="cosine", nr_char=10):
    """Define a network for the pretraining. We simply add an output layer onto
    the tok2vec input model. The tok2vec input model needs to be a model that
    takes a batch of Doc objects (as a list), and returns a list of arrays.
    Each array in the output needs to have one row per token in the doc.
    """
-    output_size = nlp.vocab.vectors.data.shape[1]
+    if objective == "characters":
-    output_layer = chain(
+        out_sizes = [256] * nr_char
-        LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
+        output_layer = chain(
-    )
+            LN(Maxout(300, pieces=3)),
            MultiSoftmax(out_sizes, 300)
        )
    else:
        output_size = nlp.vocab.vectors.data.shape[1]
        output_layer = chain(
            LN(Maxout(300, pieces=3)), Affine(output_size, drop_factor=0.0)
        )
    # This is annoying, but the parser etc have the flatten step after
    # the tok2vec. To load the weights in cleanly, we need to match
    # the shape of the models' components exactly. So what we cann
--- a/spacy/cli/train.py
+++ b/spacy/cli/train.py
@ -285,7 +285,7 @@ def train(
    if base_model and not pipes_added:
        # Start with an existing model, use default optimizer
-        optimizer = create_default_optimizer(Model.ops)
+        optimizer = nlp.resume_training(device=use_gpu)
    else:
        # Start with a blank model, call begin_training
        cfg = {"device": use_gpu}
@ -576,6 +576,8 @@ def train(
        with nlp.use_params(optimizer.averages):
            final_model_path = output_path / "model-final"
            nlp.to_disk(final_model_path)
            srsly.write_json(final_model_path / "meta.json", meta)
            meta_loc = output_path / "model-final" / "meta.json"
            final_meta = srsly.read_json(meta_loc)
            final_meta.setdefault("accuracy", {})
--- a/spacy/lang/en/init.py
+++ b/spacy/lang/en/init.py
@ -18,41 +18,6 @@ def _return_en(_):
    return "en"
 def en_is_base_form(univ_pos, morphology=None):
    """
    Check whether we're dealing with an uninflected paradigm, so we can
    avoid lemmatization entirely.
    univ_pos (unicode / int): The token's universal part-of-speech tag.
    morphology (dict): The token's morphological features following the
        Universal Dependencies scheme.
    """
    if morphology is None:
        morphology = {}
    if univ_pos == "noun" and morphology.get("Number") == "sing":
        return True
    elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
        return True
    # This maps 'VBP' to base form -- probably just need 'IS_BASE'
    # morphology
    elif univ_pos == "verb" and (
        morphology.get("VerbForm") == "fin"
        and morphology.get("Tense") == "pres"
        and morphology.get("Number") is None
    ):
        return True
    elif univ_pos == "adj" and morphology.get("Degree") == "pos":
        return True
    elif morphology.get("VerbForm") == "inf":
        return True
    elif morphology.get("VerbForm") == "none":
        return True
    elif morphology.get("Degree") == "pos":
        return True
    else:
        return False
 class EnglishDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
@ -61,7 +26,6 @@ class EnglishDefaults(Language.Defaults):
    tag_map = TAG_MAP
    stop_words = STOP_WORDS
    morph_rules = MORPH_RULES
    is_base_form = en_is_base_form
    syntax_iterators = SYNTAX_ITERATORS
    single_orth_variants = [
        {"tags": ["NFP"], "variants": ["…", "..."]},
@ -72,6 +36,41 @@ class EnglishDefaults(Language.Defaults):
        {"tags": ["``", "''"], "variants": [('"', '"'), ("“", "”")]},
    ]
    @classmethod
    def is_base_form(cls, univ_pos, morphology=None):
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.
        univ_pos (unicode / int): The token's universal part-of-speech tag.
        morphology (dict): The token's morphological features following the
            Universal Dependencies scheme.
        """
        if morphology is None:
            morphology = {}
        if univ_pos == "noun" and morphology.get("Number") == "sing":
            return True
        elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
            return True
        # This maps 'VBP' to base form -- probably just need 'IS_BASE'
        # morphology
        elif univ_pos == "verb" and (
            morphology.get("VerbForm") == "fin"
            and morphology.get("Tense") == "pres"
            and morphology.get("Number") is None
        ):
            return True
        elif univ_pos == "adj" and morphology.get("Degree") == "pos":
            return True
        elif morphology.get("VerbForm") == "inf":
            return True
        elif morphology.get("VerbForm") == "none":
            return True
        elif morphology.get("Degree") == "pos":
            return True
        else:
            return False
 class English(Language):
    lang = "en"
--- a/spacy/lang/fr/lemmatizer.py
+++ b/spacy/lang/fr/lemmatizer.py
@ -45,9 +45,6 @@ class FrenchLemmatizer(Lemmatizer):
            univ_pos = "sconj"
        else:
            return [self.lookup(string)]
        # See Issue #435 for example of where this logic is requied.
        if self.is_base_form(univ_pos, morphology):
            return list(set([string.lower()]))
        index_table = self.lookups.get_table("lemma_index", {})
        exc_table = self.lookups.get_table("lemma_exc", {})
        rules_table = self.lookups.get_table("lemma_rules", {})
@ -59,43 +56,6 @@ class FrenchLemmatizer(Lemmatizer):
        )
        return lemmas
    def is_base_form(self, univ_pos, morphology=None):
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.
        """
        morphology = {} if morphology is None else morphology
        others = [
            key
            for key in morphology
            if key not in (POS, "Number", "POS", "VerbForm", "Tense")
        ]
        if univ_pos == "noun" and morphology.get("Number") == "sing":
            return True
        elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
            return True
        # This maps 'VBP' to base form -- probably just need 'IS_BASE'
        # morphology
        elif univ_pos == "verb" and (
            morphology.get("VerbForm") == "fin"
            and morphology.get("Tense") == "pres"
            and morphology.get("Number") is None
            and not others
        ):
            return True
        elif univ_pos == "adj" and morphology.get("Degree") == "pos":
            return True
        elif VerbForm_inf in morphology:
            return True
        elif VerbForm_none in morphology:
            return True
        elif Number_sing in morphology:
            return True
        elif Degree_pos in morphology:
            return True
        else:
            return False
    def noun(self, string, morphology=None):
        return self(string, "noun", morphology)
--- a/spacy/lang/ko/init.py
+++ b/spacy/lang/ko/init.py
@ -42,7 +42,11 @@ def check_spaces(text, tokens):
 class KoreanTokenizer(DummyTokenizer):
    def __init__(self, cls, nlp=None):
        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
-        self.Tokenizer = try_mecab_import()
+        MeCab = try_mecab_import()
        self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
    def __del__(self):
        self.mecab_tokenizer.__del__()
    def __call__(self, text):
        dtokens = list(self.detailed_tokens(text))
@ -58,17 +62,16 @@ class KoreanTokenizer(DummyTokenizer):
    def detailed_tokens(self, text):
        # 품사 태그(POS)[0], 의미 부류(semantic class)[1],	종성 유무(jongseong)[2], 읽기(reading)[3],
        # 타입(type)[4], 첫번째 품사(start pos)[5],	마지막 품사(end pos)[6], 표현(expression)[7], *
-        with self.Tokenizer("-F%f[0],%f[7]") as tokenizer:
+        for node in self.mecab_tokenizer.parse(text, as_nodes=True):
-            for node in tokenizer.parse(text, as_nodes=True):
+            if node.is_eos():
-                if node.is_eos():
+                break
-                    break
+            surface = node.surface
-                surface = node.surface
+            feature = node.feature
-                feature = node.feature
+            tag, _, expr = feature.partition(",")
-                tag, _, expr = feature.partition(",")
+            lemma, _, remainder = expr.partition("/")
-                lemma, _, remainder = expr.partition("/")
+            if lemma == "*":
-                if lemma == "*":
+                lemma = surface
-                    lemma = surface
+            yield {"surface": surface, "lemma": lemma, "tag": tag}
                yield {"surface": surface, "lemma": lemma, "tag": tag}
 class KoreanDefaults(Language.Defaults):
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@ -21,7 +21,7 @@ class Lemmatizer(object):
    def load(cls, *args, **kwargs):
        raise NotImplementedError(Errors.E172)
-    def __init__(self, lookups, *args, is_base_form=None, **kwargs):
+    def __init__(self, lookups, is_base_form=None, *args, **kwargs):
        """Initialize a Lemmatizer.
        lookups (Lookups): The lookups object containing the (optional) tables
--- a/spacy/ml/_legacy_tok2vec.py
+++ b/spacy/ml/_legacy_tok2vec.py
@ -49,6 +49,14 @@ def Tok2Vec(width, embed_size, **kwargs):
                    >> LN(Maxout(width, width * 5, pieces=3)),
                    column=cols.index(ORTH),
                )
            elif char_embed:
                embed = concatenate_lists(
                    CharacterEmbed(nM=64, nC=8),
                    FeatureExtracter(cols) >> with_flatten(glove),
                )
                reduce_dimensions = LN(
                    Maxout(width, 64 * 8 + width, pieces=cnn_maxout_pieces)
                )
            else:
                embed = uniqued(
                    (glove | norm) >> LN(Maxout(width, width * 2, pieces=3)),
@ -81,7 +89,8 @@ def Tok2Vec(width, embed_size, **kwargs):
            )
        else:
            tok2vec = FeatureExtracter(cols) >> with_flatten(
-                embed >> convolution ** conv_depth, pad=conv_depth
+                embed
                >> convolution ** conv_depth, pad=conv_depth
            )
        if bilstm_depth >= 1:
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -33,6 +33,7 @@ from .._ml import build_text_classifier, build_simple_cnn_text_classifier
 from .._ml import build_bow_text_classifier, build_nel_encoder
 from .._ml import link_vectors_to_models, zero_init, flatten
 from .._ml import masked_language_model, create_default_optimizer, get_cossim_loss
 from .._ml import MultiSoftmax, get_characters_loss
 from ..errors import Errors, TempErrors, Warnings
 from .. import util
@ -846,11 +847,15 @@ class MultitaskObjective(Tagger):
 class ClozeMultitask(Pipe):
    @classmethod
    def Model(cls, vocab, tok2vec, **cfg):
-        output_size = vocab.vectors.data.shape[1]
+        if cfg["objective"] == "characters":
-        output_layer = chain(
+            out_sizes = [256] * cfg.get("nr_char", 4)
-            LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
+            output_layer = MultiSoftmax(out_sizes)
-            zero_init(Affine(output_size, output_size, drop_factor=0.0))
+        else:
-        )
+            output_size = vocab.vectors.data.shape[1]
            output_layer = chain(
                LayerNorm(Maxout(output_size, tok2vec.nO, pieces=3)),
                zero_init(Affine(output_size, output_size, drop_factor=0.0))
            )
        model = chain(tok2vec, output_layer)
        model = masked_language_model(vocab, model)
        model.tok2vec = tok2vec
@ -861,6 +866,8 @@ class ClozeMultitask(Pipe):
        self.vocab = vocab
        self.model = model
        self.cfg = cfg
        self.cfg.setdefault("objective", "characters")
        self.cfg.setdefault("nr_char", 4)
    def set_annotations(self, docs, dep_ids, tensors=None):
        pass
@ -869,7 +876,8 @@ class ClozeMultitask(Pipe):
                        tok2vec=None, sgd=None, **kwargs):
        link_vectors_to_models(self.vocab)
        if self.model is True:
-            self.model = self.Model(self.vocab, tok2vec)
+            kwargs.update(self.cfg)
            self.model = self.Model(self.vocab, tok2vec, **kwargs)
        X = self.model.ops.allocate((5, self.model.tok2vec.nO))
        self.model.output_layer.begin_training(X)
        if sgd is None:
@ -883,13 +891,16 @@ class ClozeMultitask(Pipe):
        return tokvecs, vectors
    def get_loss(self, docs, vectors, prediction):
-        # The simplest way to implement this would be to vstack the
+        if self.cfg["objective"] == "characters":
-        # token.vector values, but that's a bit inefficient, especially on GPU.
+            loss, gradient = get_characters_loss(self.model.ops, docs, prediction)
-        # Instead we fetch the index into the vectors table for each of our tokens,
+        else:
-        # and look them up all at once. This prevents data copying.
+            # The simplest way to implement this would be to vstack the
-        ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
+            # token.vector values, but that's a bit inefficient, especially on GPU.
-        target = vectors[ids]
+            # Instead we fetch the index into the vectors table for each of our tokens,
-        loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True)
+            # and look them up all at once. This prevents data copying.
            ids = self.model.ops.flatten([doc.to_array(ID).ravel() for doc in docs])
            target = vectors[ids]
            loss, gradient = get_cossim_loss(prediction, target, ignore_zeros=True)
        return float(loss), gradient
    def update(self, docs, golds, drop=0., sgd=None, losses=None):
@ -906,6 +917,20 @@ class ClozeMultitask(Pipe):
        if losses is not None:
            losses[self.name] += loss
    @staticmethod
    def decode_utf8_predictions(char_array):
        # The format alternates filling from start and end, and 255 is missing
        words = []
        char_array = char_array.reshape((char_array.shape[0], -1, 256))
        nr_char = char_array.shape[1]
        char_array = char_array.argmax(axis=-1)
        for row in char_array:
            starts = [chr(c) for c in row[::2] if c != 255]
            ends = [chr(c) for c in row[1::2] if c != 255]
            word = "".join(starts + list(reversed(ends)))
            words.append(word)
        return words
@component("textcat", assigns=["doc.cats"])
 class TextCategorizer(Pipe):
@ -1069,6 +1094,7 @@ cdef class DependencyParser(Parser):
    assigns = ["token.dep", "token.is_sent_start", "doc.sents"]
    requires = []
    TransitionSystem = ArcEager
    nr_feature = 8
    @property
    def postprocesses(self):
--- a/spacy/tests/regression/test_issue2501-3000.py
+++ b/spacy/tests/regression/test_issue2501-3000.py
@ -59,7 +59,7 @@ def test_issue2626_2835(en_tokenizer, text):
 def test_issue2656(en_tokenizer):
-    """Test that tokenizer correctly splits of punctuation after numbers with
+    """Test that tokenizer correctly splits off punctuation after numbers with
    decimal points.
    """
    doc = en_tokenizer("I went for 40.3, and got home by 10.0.")
--- a/spacy/tests/regression/test_issue3001-3500.py
+++ b/spacy/tests/regression/test_issue3001-3500.py
@ -121,6 +121,7 @@ def test_issue3248_1():
    assert len(matcher) == 2
@pytest.mark.skipif(is_python2, reason="Can't pickle instancemethod for is_base_form")
 def test_issue3248_2():
    """Test that the PhraseMatcher can be pickled correctly."""
    nlp = English()
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@ -473,7 +473,7 @@ $ python -m spacy pretrain [texts_loc] [vectors_model] [output_dir]
 | `--use-chars`, `-chr` <Tag variant="new">2.2.2</Tag>  | flag       | Whether to use character-based embedding.                                                                                                                                       |
 | `--sa-depth`, `-sa` <Tag variant="new">2.2.2</Tag>    | option     | Depth of self-attention layers.                                                                                                                                                 |
 | `--embed-rows`, `-er`                                 | option     | Number of embedding rows.                                                                                                                                                       |
-| `--loss-func`, `-L`                                   | option     | Loss function to use for the objective. Either `"L2"` or `"cosine"`.                                                                                                            |
+| `--loss-func`, `-L`                                   | option     | Loss function to use for the objective. Either `"cosine"`, `"L2"` or `"characters"`.                                                                                                            |
 | `--dropout`, `-d`                                     | option     | Dropout rate.                                                                                                                                                                   |
 | `--batch-size`, `-bs`                                 | option     | Number of words per training batch.                                                                                                                                             |
 | `--max-length`, `-xw`                                 | option     | Maximum words per example. Longer examples are discarded.                                                                                                                       |
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@ -1,5 +1,58 @@
 {
    "resources": [
        {
            "id": "spacy-streamlit",
            "title": "spacy-streamlit",
            "slogan": "spaCy building blocks for Streamlit apps",
            "github": "explosion/spacy-streamlit",
            "description": "This package contains utilities for visualizing spaCy models and building interactive spaCy-powered apps with [Streamlit](https://streamlit.io). It includes various building blocks you can use in your own Streamlit app, like visualizers for **syntactic dependencies**, **named entities**, **text classification**, **semantic similarity** via word vectors, token attributes, and more.",
            "pip": "spacy-streamlit",
            "category": ["visualizers"],
            "thumb": "https://i.imgur.com/mhEjluE.jpg",
            "image": "https://user-images.githubusercontent.com/13643239/85388081-f2da8700-b545-11ea-9bd4-e303d3c5763c.png",
            "code_example": [
                "import spacy_streamlit",
                "",
                "models = [\"en_core_web_sm\", \"en_core_web_md\"]",
                "default_text = \"Sundar Pichai is the CEO of Google.\"",
                "spacy_streamlit.visualize(models, default_text))"
            ],
            "author": "Ines Montani",
            "author_links": {
                "twitter": "_inesmontani",
                "github": "ines",
                "website": "https://ines.io"
            }
        },
        {
            "id": "spaczz",
            "title": "spaczz",
            "slogan": "Fuzzy matching and more for spaCy.",
            "description": "Spaczz provides fuzzy matching and multi-token regex matching functionality for spaCy. Spaczz's components have similar APIs to their spaCy counterparts and spaczz pipeline components can integrate into spaCy pipelines where they can be saved/loaded as models.",
            "github": "gandersen101/spaczz",
            "pip": "spaczz",
            "code_example": [
                "import spacy",
                "from spaczz.pipeline import SpaczzRuler",
                "",
                "nlp = spacy.blank('en')",
                "ruler = SpaczzRuler(nlp)",
                "ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
                "nlp.add_pipe(ruler)",
                "",
                "doc = nlp('Oops, I spelled Bill Gatez wrong.')",
                "print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
            ],
            "code_language": "python",
            "url": "https://spaczz.readthedocs.io/en/latest/",
            "author": "Grant Andersen",
            "author_links": {
                "twitter": "gandersen101",
                "github": "gandersen101"
            },
            "category": ["pipeline"],
            "tags": ["fuzzy-matching", "regex"]
        },
        {
            "id": "spacy-universal-sentence-encoder",
            "title": "SpaCy - Universal Sentence Encoder",
@ -1237,6 +1290,19 @@
            "youtube": "K1elwpgDdls",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-spacy-course-es",
            "title": "NLP avanzado con spaCy · Un curso en línea gratis",
            "description": "spaCy es un paquete moderno de Python para hacer Procesamiento de Lenguaje Natural de potencia industrial. En este curso en línea, interactivo y gratuito, aprenderás a usar spaCy para construir sistemas avanzados de comprensión de lenguaje natural usando enfoques basados en reglas y en machine learning.",
            "url": "https://course.spacy.io/es",
            "author": "Camila Gutiérrez",
            "author_links": {
                "twitter": "Mariacamilagl30"
            },
            "youtube": "RNiLVCE5d4k",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-intro-to-nlp-episode-1",
@ -1293,6 +1359,20 @@
            "youtube": "IqOJU1-_Fi0",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-intro-to-nlp-episode-5",
            "title": "Intro to NLP with spaCy (5)",
            "slogan": "Episode 5: Rules vs. Machine Learning",
            "description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
            "author": "Vincent Warmerdam",
            "author_links": {
                "twitter": "fishnets88",
                "github": "koaning"
            },
            "youtube": "f4sqeLRzkPg",
            "category": ["videos"]
        },
        {
            "type": "education",
            "id": "video-spacy-irl-entity-linking",
@ -2347,6 +2427,32 @@
            },
            "category": ["pipeline", "conversational", "research"],
            "tags": ["spell check", "correction", "preprocessing", "translation", "correction"]
        },
        {
            "id": "texthero",
            "title": "Texthero",
            "slogan": "Text preprocessing, representation and visualization from zero to hero.",
            "description": "Texthero is a python package to work with text data efficiently. It empowers NLP developers with a tool to quickly understand any text-based dataset and it provides a solid pipeline to clean and represent text data, from zero to hero.",
            "github": "jbesomi/texthero",
            "pip": "texthero",
            "code_example": [
                "import texthero as hero",
                "import pandas as pd",
                "",
                "df = pd.read_csv('https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv')",
                "df['named_entities'] = hero.named_entities(df['text'])",
                "df.head()"
            ],
            "code_language": "python",
            "url": "https://texthero.org",
            "thumb": "https://texthero.org/img/T.png",
            "image": "https://texthero.org/docs/assets/texthero.png",
            "author": "Jonathan Besomi",
            "author_links": {
                "github": "jbesomi",
                "website": "https://besomi.ai"
            },
            "category": ["standalone"]
        }
    ],