Merge branch 'master' into spacy.io

2025-07-25 07:29:45 +03:00 · 2019-11-11 17:12:19 +01:00 · 2019-11-11 17:12:19 +01:00 · 71f5a5daa1
commit 71f5a5daa1
parent b6534d7875 9d5ff177c4
28 changed files with 482 additions and 186 deletions
--- a/.github/contributors/prilopes.md
+++ b/.github/contributors/prilopes.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Priscilla Lopes      |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2019-11-06           |
+| GitHub username                | prilopes             |
+| Website (optional)             |                      |
--- a/bin/ud/ud_train.py
+++ b/bin/ud/ud_train.py
@ -221,6 +221,13 @@ def evaluate(nlp, text_loc, gold_loc, sys_loc, limit=None):


 def write_conllu(docs, file_):
+    if not Token.has_extension("get_conllu_lines"):
+        Token.set_extension("get_conllu_lines", method=get_token_conllu)
+    if not Token.has_extension("begins_fused"):
+        Token.set_extension("begins_fused", default=False)
+    if not Token.has_extension("inside_fused"):
+        Token.set_extension("inside_fused", default=False)
+
    merger = Matcher(docs[0].vocab)
    merger.add("SUBTOK", None, [{"DEP": "subtok", "op": "+"}])
    for i, doc in enumerate(docs):
--- a/requirements.txt
+++ b/requirements.txt
@ -6,12 +6,12 @@ blis>=0.4.0,<0.5.0
 murmurhash>=0.28.0,<1.1.0
 wasabi>=0.4.0,<1.1.0
 srsly>=0.1.0,<1.1.0
+catalogue>=0.0.7,<1.1.0
 # Third party dependencies
 numpy>=1.15.0
 requests>=2.13.0,<3.0.0
 plac>=0.9.6,<1.2.0
 pathlib==1.0.1; python_version < "3.4"
-importlib_metadata>=0.20; python_version < "3.8"
 # Optional dependencies
 jsonschema>=2.6.0,<3.1.0
 # Development dependencies
--- a/setup.cfg
+++ b/setup.cfg
@ -48,13 +48,13 @@ install_requires =
    blis>=0.4.0,<0.5.0
    wasabi>=0.4.0,<1.1.0
    srsly>=0.1.0,<1.1.0
+    catalogue>=0.0.7,<1.1.0
    # Third-party dependencies
    setuptools
    numpy>=1.15.0
    plac>=0.9.6,<1.2.0
    requests>=2.13.0,<3.0.0
    pathlib==1.0.1; python_version < "3.4"
-    importlib_metadata>=0.20; python_version < "3.8"

 [options.extras_require]
 lookups =
--- a/spacy/init.py
+++ b/spacy/init.py
@ -15,7 +15,7 @@ from .glossary import explain
 from .about import __version__
 from .errors import Errors, Warnings, deprecation_warning
 from . import util
-from .util import register_architecture, get_architecture
+from .util import registry
 from .language import component


--- a/spacy/compat.py
+++ b/spacy/compat.py
@ -36,11 +36,6 @@ try:
 except ImportError:
    cupy = None

-try:  # Python 3.8
-    import importlib.metadata as importlib_metadata
-except ImportError:
-    import importlib_metadata  # noqa: F401
-
 try:
    from thinc.neural.optimizers import Optimizer  # noqa: F401
 except ImportError:
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@ -5,7 +5,7 @@ import uuid

 from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_ARCS, TPL_ENTS
 from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
-from ..util import minify_html, escape_html, get_entry_points, ENTRY_POINTS
+from ..util import minify_html, escape_html, registry
 from ..errors import Errors


@ -242,7 +242,7 @@ class EntityRenderer(object):
            "CARDINAL": "#e4e7d2",
            "PERCENT": "#e4e7d2",
        }
-        user_colors = get_entry_points(ENTRY_POINTS.displacy_colors)
+        user_colors = registry.displacy_colors.get_all()
        for user_color in user_colors.values():
            colors.update(user_color)
        colors.update(options.get("colors", {}))
--- a/spacy/lang/es/examples.py
+++ b/spacy/lang/es/examples.py
@ -11,12 +11,12 @@ Example sentences to test spaCy and its language models.


 sentences = [
-    "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares",
-    "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes",
-    "San Francisco analiza prohibir los robots delivery",
-    "Londres es una gran ciudad del Reino Unido",
-    "El gato come pescado",
-    "Veo al hombre con el telescopio",
-    "La araña come moscas",
-    "El pingüino incuba en su nido",
+    "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
+    "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
+    "San Francisco analiza prohibir los robots delivery.",
+    "Londres es una gran ciudad del Reino Unido.",
+    "El gato come pescado.",
+    "Veo al hombre con el telescopio.",
+    "La araña come moscas.",
+    "El pingüino incuba en su nido.",
 ]
--- a/spacy/lang/nb/examples.py
+++ b/spacy/lang/nb/examples.py
@ -11,8 +11,8 @@ Example sentences to test spaCy and its language models.


 sentences = [
-    "Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar",
-    "Selvkjørende biler flytter forsikringsansvaret over på produsentene ",
-    "San Francisco vurderer å forby robotbud på fortauene",
+    "Apple vurderer å kjøpe britisk oppstartfirma for en milliard dollar.",
+    "Selvkjørende biler flytter forsikringsansvaret over på produsentene.",
+    "San Francisco vurderer å forby robotbud på fortauene.",
    "London er en stor by i Storbritannia.",
 ]
--- a/spacy/lang/xx/examples.py
+++ b/spacy/lang/xx/examples.py
@ -0,0 +1,99 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.de.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+# combined examples from de/en/es/fr/it/nl/pl/pt/ru
+
+sentences = [
+    "Die ganze Stadt ist ein Startup: Shenzhen ist das Silicon Valley für Hardware-Firmen",
+    "Wie deutsche Startups die Technologie vorantreiben wollen: Künstliche Intelligenz",
+    "Trend zum Urlaub in Deutschland beschert Gastwirten mehr Umsatz",
+    "Bundesanwaltschaft erhebt Anklage gegen mutmaßlichen Schweizer Spion",
+    "San Francisco erwägt Verbot von Lieferrobotern",
+    "Autonome Fahrzeuge verlagern Haftpflicht auf Hersteller",
+    "Wo bist du?",
+    "Was ist die Hauptstadt von Deutschland?",
+    "Apple is looking at buying U.K. startup for $1 billion",
+    "Autonomous cars shift insurance liability toward manufacturers",
+    "San Francisco considers banning sidewalk delivery robots",
+    "London is a big city in the United Kingdom.",
+    "Where are you?",
+    "Who is the president of France?",
+    "What is the capital of the United States?",
+    "When was Barack Obama born?",
+    "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
+    "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
+    "San Francisco analiza prohibir los robots delivery.",
+    "Londres es una gran ciudad del Reino Unido.",
+    "El gato come pescado.",
+    "Veo al hombre con el telescopio.",
+    "La araña come moscas.",
+    "El pingüino incuba en su nido.",
+    "Apple cherche à acheter une start-up anglaise pour 1 milliard de dollars",
+    "Les voitures autonomes déplacent la responsabilité de l'assurance vers les constructeurs",
+    "San Francisco envisage d'interdire les robots coursiers sur les trottoirs",
+    "Londres est une grande ville du Royaume-Uni",
+    "L’Italie choisit ArcelorMittal pour reprendre la plus grande aciérie d’Europe",
+    "Apple lance HomePod parce qu'il se sent menacé par l'Echo d'Amazon",
+    "La France ne devrait pas manquer d'électricité cet été, même en cas de canicule",
+    "Nouvelles attaques de Trump contre le maire de Londres",
+    "Où es-tu ?",
+    "Qui est le président de la France ?",
+    "Où est la capitale des États-Unis ?",
+    "Quand est né Barack Obama ?",
+    "Apple vuole comprare una startup del Regno Unito per un miliardo di dollari",
+    "Le automobili a guida autonoma spostano la responsabilità assicurativa verso i produttori",
+    "San Francisco prevede di bandire i robot di consegna porta a porta",
+    "Londra è una grande città del Regno Unito.",
+    "Apple overweegt om voor 1 miljard een U.K. startup te kopen",
+    "Autonome auto's verschuiven de verzekeringverantwoordelijkheid naar producenten",
+    "San Francisco overweegt robots op voetpaden te verbieden",
+    "Londen is een grote stad in het Verenigd Koninkrijk",
+    "Poczuł przyjemną woń mocnej kawy.",
+    "Istnieje wiele dróg oddziaływania substancji psychoaktywnej na układ nerwowy.",
+    "Powitał mnie biało-czarny kot, płosząc siedzące na płocie trzy dorodne dudki.",
+    "Nowy abonament pod lupą Komisji Europejskiej",
+    "Czy w ciągu ostatnich 48 godzin spożyłeś leki zawierające paracetamol?",
+    "Kto ma ochotę zapoznać się z innymi niż w książkach przygodami Muminków i ich przyjaciół, temu polecam komiks Tove Jansson „Muminki i morze”.",
+    "Apple está querendo comprar uma startup do Reino Unido por 100 milhões de dólares.",
+    "Carros autônomos empurram a responsabilidade do seguro para os fabricantes.."
+    "São Francisco considera banir os robôs de entrega que andam pelas calçadas.",
+    "Londres é a maior cidade do Reino Unido.",
+    # Translations from English:
+    "Apple рассматривает возможность покупки стартапа из Соединённого Королевства за $1 млрд",
+    "Беспилотные автомобили перекладывают страховую ответственность на производителя",
+    "В Сан-Франциско рассматривается возможность запрета роботов-курьеров, которые перемещаются по тротуару",
+    "Лондон — это большой город в Соединённом Королевстве",
+    # Native Russian sentences:
+    # Colloquial:
+    "Да, нет, наверное!",  # Typical polite refusal
+    "Обратите внимание на необыкновенную красоту этого города-героя Москвы, столицы нашей Родины!",  # From a tour guide speech
+    # Examples of Bookish Russian:
+    # Quote from "The Golden Calf"
+    "Рио-де-Жанейро — это моя мечта, и не смейте касаться её своими грязными лапами!",
+    # Quotes from "Ivan Vasilievich changes his occupation"
+    "Ты пошто боярыню обидел, смерд?!!",
+    "Оставь меня, старушка, я в печали!",
+    # Quotes from Dostoevsky:
+    "Уж коли я, такой же, как и ты, человек грешный, над тобой умилился и пожалел тебя, кольми паче бог",
+    "В мечтах я нередко, говорит, доходил до страстных помыслов о служении человечеству и может быть действительно пошел бы на крест за людей, если б это вдруг как-нибудь потребовалось, а между тем я двух дней не в состоянии прожить ни с кем в одной комнате, о чем знаю из опыта",
+    "Зато всегда так происходило, что чем более я ненавидел людей в частности, тем пламеннее становилась любовь моя к человечеству вообще",
+    # Quotes from Chekhov:
+    "Ненужные дела и разговоры всё об одном отхватывают на свою долю лучшую часть времени, лучшие силы, и в конце концов остается какая-то куцая, бескрылая жизнь, какая-то чепуха, и уйти и бежать нельзя, точно сидишь в сумасшедшем доме или в арестантских ротах!",
+    # Quotes from Turgenev:
+    "Нравится тебе женщина, старайся добиться толку; а нельзя — ну, не надо, отвернись — земля не клином сошлась",
+    "Узенькое местечко, которое я занимаю, до того крохотно в сравнении с остальным пространством, где меня нет и где дела до меня нет; и часть времени, которую мне удастся прожить, так ничтожна перед вечностью, где меня не было и не будет...",
+    # Quotes from newspapers:
+    # Komsomolskaya Pravda:
+    "На заседании президиума правительства Москвы принято решение присвоить статус инвестиционного приоритетного проекта города Москвы киностудии Союзмультфильм",
+    "Глава Минобороны Сергей Шойгу заявил, что обстановка на этом стратегическом направлении требует непрерывного совершенствования боевого состава войск",
+    # Argumenty i Facty:
+    "На реплику лже-Говина — дескать, он (Волков) будет лучшим революционером — Стамп с энтузиазмом ответил: Непременно!",
+]
--- a/spacy/lang/zh/init.py
+++ b/spacy/lang/zh/init.py
@ -4,19 +4,92 @@ from __future__ import unicode_literals
 from ...attrs import LANG
 from ...language import Language
 from ...tokens import Doc
+from ...util import DummyTokenizer
 from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from .lex_attrs import LEX_ATTRS
 from .stop_words import STOP_WORDS
 from .tag_map import TAG_MAP


+def try_jieba_import(use_jieba):
+    try:
+        import jieba
+        return jieba
+    except ImportError:
+        if use_jieba:
+            msg = (
+                "Jieba not installed. Either set Chinese.use_jieba = False, "
+                "or install it https://github.com/fxsjy/jieba"
+            )
+            raise ImportError(msg)
+
+
+class ChineseTokenizer(DummyTokenizer):
+    def __init__(self, cls, nlp=None):
+        self.vocab = nlp.vocab if nlp is not None else cls.create_vocab(nlp)
+        self.use_jieba = cls.use_jieba
+        self.jieba_seg = try_jieba_import(self.use_jieba)
+        self.tokenizer = Language.Defaults().create_tokenizer(nlp)
+
+    def __call__(self, text):
+        # use jieba
+        if self.use_jieba:
+            jieba_words = list([x for x in self.jieba_seg.cut(text, cut_all=False) if x])
+            words = [jieba_words[0]]
+            spaces = [False]
+            for i in range(1, len(jieba_words)):
+                word = jieba_words[i]
+                if word.isspace():
+                    # second token in adjacent whitespace following a
+                    # non-space token
+                    if spaces[-1]:
+                        words.append(word)
+                        spaces.append(False)
+                    # first space token following non-space token
+                    elif word == " " and not words[-1].isspace():
+                        spaces[-1] = True
+                    # token is non-space whitespace or any whitespace following
+                    # a whitespace token
+                    else:
+                        # extend previous whitespace token with more whitespace
+                        if words[-1].isspace():
+                            words[-1] += word
+                        # otherwise it's a new whitespace token
+                        else:
+                            words.append(word)
+                            spaces.append(False)
+                else:
+                    words.append(word)
+                    spaces.append(False)
+            return Doc(self.vocab, words=words, spaces=spaces)
+
+        # split into individual characters
+        words = []
+        spaces = []
+        for token in self.tokenizer(text):
+            if token.text.isspace():
+                words.append(token.text)
+                spaces.append(False)
+            else:
+                words.extend(list(token.text))
+                spaces.extend([False] * len(token.text))
+                spaces[-1] = bool(token.whitespace_)
+        return Doc(self.vocab, words=words, spaces=spaces)
+
+
 class ChineseDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "zh"
-    use_jieba = True
    tokenizer_exceptions = BASE_EXCEPTIONS
    stop_words = STOP_WORDS
    tag_map = TAG_MAP
    writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
+    use_jieba = True
+
+    @classmethod
+    def create_tokenizer(cls, nlp=None):
+        return ChineseTokenizer(cls, nlp)


 class Chinese(Language):
@ -24,26 +97,7 @@ class Chinese(Language):
    Defaults = ChineseDefaults  # override defaults

    def make_doc(self, text):
-        if self.Defaults.use_jieba:
-            try:
-                import jieba
-            except ImportError:
-                msg = (
-                    "Jieba not installed. Either set Chinese.use_jieba = False, "
-                    "or install it https://github.com/fxsjy/jieba"
-                )
-                raise ImportError(msg)
-            words = list(jieba.cut(text, cut_all=False))
-            words = [x for x in words if x]
-            return Doc(self.vocab, words=words, spaces=[False] * len(words))
-        else:
-            words = []
-            spaces = []
-            for token in self.tokenizer(text):
-                words.extend(list(token.text))
-                spaces.extend([False] * len(token.text))
-                spaces[-1] = bool(token.whitespace_)
-            return Doc(self.vocab, words=words, spaces=spaces)
+        return self.tokenizer(text)


 __all__ = ["Chinese"]
--- a/spacy/lang/zh/tag_map.py
+++ b/spacy/lang/zh/tag_map.py
@ -1,11 +1,12 @@
 # coding: utf8
 from __future__ import unicode_literals

-from ...symbols import POS, PUNCT, ADJ, CONJ, CCONJ, NUM, DET, ADV, ADP, X, VERB
-from ...symbols import NOUN, PART, INTJ, PRON
+from ...symbols import POS, PUNCT, ADJ, SCONJ, CCONJ, NUM, DET, ADV, ADP, X
+from ...symbols import NOUN, PART, INTJ, PRON, VERB, SPACE

-# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn Treebank tag set.
-# We also map the tags to the simpler Google Universal POS tag set.
+# The Chinese part-of-speech tagger uses the OntoNotes 5 version of the Penn
+# Treebank tag set. We also map the tags to the simpler Universal Dependencies
+# v2 tag set.

 TAG_MAP = {
    "AS": {POS: PART},
@ -38,10 +39,11 @@ TAG_MAP = {
    "OD": {POS: NUM},
    "DT": {POS: DET},
    "CC": {POS: CCONJ},
-    "CS": {POS: CONJ},
+    "CS": {POS: SCONJ},
    "AD": {POS: ADV},
    "JJ": {POS: ADJ},
    "P": {POS: ADP},
    "PN": {POS: PRON},
    "PU": {POS: PUNCT},
+    "_SP": {POS: SPACE},
 }
--- a/spacy/language.py
+++ b/spacy/language.py
@ -51,8 +51,8 @@ class BaseDefaults(object):
        filenames = {name: root / filename for name, filename in cls.resources}
        if LANG in cls.lex_attr_getters:
            lang = cls.lex_attr_getters[LANG](None)
-            user_lookups = util.get_entry_point(util.ENTRY_POINTS.lookups, lang, {})
-            filenames.update(user_lookups)
+            if lang in util.registry.lookups:
+                filenames.update(util.registry.lookups.get(lang))
        lookups = Lookups()
        for name, filename in filenames.items():
            data = util.load_language_data(filename)
@ -155,7 +155,7 @@ class Language(object):
            100,000 characters in one text.
        RETURNS (Language): The newly constructed object.
        """
-        user_factories = util.get_entry_points(util.ENTRY_POINTS.factories)
+        user_factories = util.registry.factories.get_all()
        self.factories.update(user_factories)
        self._meta = dict(meta)
        self._path = None
--- a/spacy/matcher/dependencymatcher.pyx
+++ b/spacy/matcher/dependencymatcher.pyx
@ -240,7 +240,7 @@ cdef class DependencyMatcher:
            for i, (ent_id, nodes) in enumerate(matched_key_trees):
                on_match = self._callbacks.get(ent_id)
                if on_match is not None:
-                    on_match(self, doc, i, matches)
+                    on_match(self, doc, i, matched_key_trees)
        return matched_key_trees

    def recurse(self,tree,id_to_position,_node_operator_map,int patternLength,visitedNodes,matched_trees):
--- a/spacy/ml/common.py
+++ b/spacy/ml/common.py
@ -3,10 +3,10 @@ from __future__ import unicode_literals
 from thinc.api import chain
 from thinc.v2v import Maxout
 from thinc.misc import LayerNorm
-from ..util import register_architecture, make_layer
+from ..util import registry, make_layer


-@register_architecture("thinc.FeedForward.v1")
+@registry.architectures.register("thinc.FeedForward.v1")
 def FeedForward(config):
    layers = [make_layer(layer_cfg) for layer_cfg in config["layers"]]
    model = chain(*layers)
@ -14,7 +14,7 @@ def FeedForward(config):
    return model


-@register_architecture("spacy.LayerNormalizedMaxout.v1")
+@registry.architectures.register("spacy.LayerNormalizedMaxout.v1")
 def LayerNormalizedMaxout(config):
    width = config["width"]
    pieces = config["pieces"]
--- a/spacy/ml/tok2vec.py
+++ b/spacy/ml/tok2vec.py
@ -6,11 +6,11 @@ from thinc.v2v import Maxout, Model
 from thinc.i2v import HashEmbed, StaticVectors
 from thinc.t2t import ExtractWindow
 from thinc.misc import Residual, LayerNorm, FeatureExtracter
-from ..util import make_layer, register_architecture
+from ..util import make_layer, registry
 from ._wire import concatenate_lists


-@register_architecture("spacy.Tok2Vec.v1")
+@registry.architectures.register("spacy.Tok2Vec.v1")
 def Tok2Vec(config):
    doc2feats = make_layer(config["@doc2feats"])
    embed = make_layer(config["@embed"])
@ -24,13 +24,13 @@ def Tok2Vec(config):
    return tok2vec


-@register_architecture("spacy.Doc2Feats.v1")
+@registry.architectures.register("spacy.Doc2Feats.v1")
 def Doc2Feats(config):
    columns = config["columns"]
    return FeatureExtracter(columns)


-@register_architecture("spacy.MultiHashEmbed.v1")
+@registry.architectures.register("spacy.MultiHashEmbed.v1")
 def MultiHashEmbed(config):
    # For backwards compatibility with models before the architecture registry,
    # we have to be careful to get exactly the same model structure. One subtle
@ -78,7 +78,7 @@ def MultiHashEmbed(config):
    return layer


-@register_architecture("spacy.CharacterEmbed.v1")
+@registry.architectures.register("spacy.CharacterEmbed.v1")
 def CharacterEmbed(config):
    from .. import _ml

@ -94,7 +94,7 @@ def CharacterEmbed(config):
    return model


-@register_architecture("spacy.MaxoutWindowEncoder.v1")
+@registry.architectures.register("spacy.MaxoutWindowEncoder.v1")
 def MaxoutWindowEncoder(config):
    nO = config["width"]
    nW = config["window_size"]
@ -110,7 +110,7 @@ def MaxoutWindowEncoder(config):
    return model


-@register_architecture("spacy.MishWindowEncoder.v1")
+@registry.architectures.register("spacy.MishWindowEncoder.v1")
 def MishWindowEncoder(config):
    from thinc.v2v import Mish

@ -124,12 +124,12 @@ def MishWindowEncoder(config):
    return model


-@register_architecture("spacy.PretrainedVectors.v1")
+@registry.architectures.register("spacy.PretrainedVectors.v1")
 def PretrainedVectors(config):
    return StaticVectors(config["vectors_name"], config["width"], config["column"])


-@register_architecture("spacy.TorchBiLSTMEncoder.v1")
+@registry.architectures.register("spacy.TorchBiLSTMEncoder.v1")
 def TorchBiLSTMEncoder(config):
    import torch.nn
    from thinc.extra.wrappers import PyTorchWrapperRNN
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -218,3 +218,9 @@ def uk_tokenizer():
@pytest.fixture(scope="session")
 def ur_tokenizer():
    return get_lang_class("ur").Defaults.create_tokenizer()
+
+
+@pytest.fixture(scope="session")
+def zh_tokenizer():
+    pytest.importorskip("jieba")
+    return get_lang_class("zh").Defaults.create_tokenizer()
--- a/spacy/tests/doc/test_retokenize_split.py
+++ b/spacy/tests/doc/test_retokenize_split.py
@ -183,3 +183,18 @@ def test_doc_retokenizer_split_lex_attrs(en_vocab):
        retokenizer.split(doc[0], ["Los", "Angeles"], heads, attrs=attrs)
    assert doc[0].is_stop
    assert not doc[1].is_stop
+
+
+def test_doc_retokenizer_realloc(en_vocab):
+    """#4604: realloc correctly when new tokens outnumber original tokens"""
+    text = "Hyperglycemic adverse events following antipsychotic drug administration in the"
+    doc = Doc(en_vocab, words=text.split()[:-1])
+    with doc.retokenize() as retokenizer:
+        token = doc[0]
+        heads = [(token, 0)] * len(token)
+        retokenizer.split(doc[token.i], list(token.text), heads=heads)
+    doc = Doc(en_vocab, words=text.split())
+    with doc.retokenize() as retokenizer:
+        token = doc[0]
+        heads = [(token, 0)] * len(token)
+        retokenizer.split(doc[token.i], list(token.text), heads=heads)
--- a/spacy/tests/lang/zh/init.py
+++ b/spacy/tests/lang/zh/init.py
--- a/spacy/tests/lang/zh/test_text.py
+++ b/spacy/tests/lang/zh/test_text.py
@ -0,0 +1,25 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+
+import pytest
+
+
+@pytest.mark.parametrize(
+    "text,match",
+    [
+        ("10", True),
+        ("1", True),
+        ("999.0", True),
+        ("一", True),
+        ("二", True),
+        ("〇", True),
+        ("十一", True),
+        ("狗", False),
+        (",", False),
+    ],
+)
+def test_lex_attrs_like_number(zh_tokenizer, text, match):
+    tokens = zh_tokenizer(text)
+    assert len(tokens) == 1
+    assert tokens[0].like_num == match
--- a/spacy/tests/lang/zh/test_tokenizer.py
+++ b/spacy/tests/lang/zh/test_tokenizer.py
@ -0,0 +1,31 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+# fmt: off
+TOKENIZER_TESTS = [
+    ("作为语言而言，为世界使用人数最多的语言，目前世界有五分之一人口做为母语。",
+        ['作为', '语言', '而言', '，', '为', '世界', '使用', '人', '数最多',
+         '的', '语言', '，', '目前', '世界', '有', '五分之一', '人口', '做',
+         '为', '母语', '。']),
+]
+# fmt: on
+
+
+@pytest.mark.parametrize("text,expected_tokens", TOKENIZER_TESTS)
+def test_zh_tokenizer(zh_tokenizer, text, expected_tokens):
+    zh_tokenizer.use_jieba = False
+    tokens = [token.text for token in zh_tokenizer(text)]
+    assert tokens == list(text)
+
+    zh_tokenizer.use_jieba = True
+    tokens = [token.text for token in zh_tokenizer(text)]
+    assert tokens == expected_tokens
+
+
+def test_extra_spaces(zh_tokenizer):
+    # note: three spaces after "I"
+    tokens = zh_tokenizer("I   like cheese.")
+    assert tokens[1].orth_ == "  "
--- a/spacy/tests/regression/test_issue4590.py
+++ b/spacy/tests/regression/test_issue4590.py
@ -0,0 +1,34 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+from mock import Mock
+from spacy.matcher import DependencyMatcher
+from ..util import get_doc
+
+
+def test_issue4590(en_vocab):
+    """Test that matches param in on_match method are the same as matches run with no on_match method"""
+    pattern = [
+        {"SPEC": {"NODE_NAME": "jumped"}, "PATTERN": {"ORTH": "jumped"}},
+        {"SPEC": {"NODE_NAME": "fox", "NBOR_RELOP": ">", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+        {"SPEC": {"NODE_NAME": "quick", "NBOR_RELOP": ".", "NBOR_NAME": "jumped"}, "PATTERN": {"ORTH": "fox"}},
+    ]
+
+    on_match = Mock()
+
+    matcher = DependencyMatcher(en_vocab)
+    matcher.add("pattern", on_match, pattern)
+
+    text = "The quick brown fox jumped over the lazy fox"
+    heads = [3, 2, 1, 1, 0, -1, 2, 1, -3]
+    deps = ["det", "amod", "amod", "nsubj", "prep", "pobj", "det", "amod"]
+    
+    doc = get_doc(en_vocab, text.split(), heads=heads, deps=deps)
+    
+    matches = matcher(doc)
+    
+    on_match_args = on_match.call_args
+
+    assert on_match_args[0][3] == matches
+
--- a/spacy/tests/test_architectures.py
+++ b/spacy/tests/test_architectures.py
@ -0,0 +1,19 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+import pytest
+from spacy import registry
+from thinc.v2v import Affine
+from catalogue import RegistryError
+
+
+@registry.architectures.register("my_test_function")
+def create_model(nr_in, nr_out):
+    return Affine(nr_in, nr_out)
+
+
+def test_get_architecture():
+    arch = registry.architectures.get("my_test_function")
+    assert arch is create_model
+    with pytest.raises(RegistryError):
+        registry.architectures.get("not_an_existing_key")
--- a/spacy/tests/test_register_architecture.py
+++ b/spacy/tests/test_register_architecture.py
@ -1,19 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-import pytest
-from spacy import register_architecture
-from spacy import get_architecture
-from thinc.v2v import Affine
-
-
-@register_architecture("my_test_function")
-def create_model(nr_in, nr_out):
-    return Affine(nr_in, nr_out)
-
-
-def test_get_architecture():
-    arch = get_architecture("my_test_function")
-    assert arch is create_model
-    with pytest.raises(KeyError):
-        get_architecture("not_an_existing_key")
--- a/spacy/tokens/_retokenize.pyx
+++ b/spacy/tokens/_retokenize.pyx
@ -329,7 +329,7 @@ def _split(Doc doc, int token_index, orths, heads, attrs):
            doc.c[i].head += offset
    # Double doc.c max_length if necessary (until big enough for all new tokens)
    while doc.length + nb_subtokens - 1 >= doc.max_length:
-        doc._realloc(doc.length * 2)
+        doc._realloc(doc.max_length * 2)
    # Move tokens after the split to create space for the new tokens
    doc.length = len(doc) + nb_subtokens -1
    to_process_tensor = (doc.tensor is not None and doc.tensor.size != 0)
--- a/spacy/util.py
+++ b/spacy/util.py
@ -13,6 +13,7 @@ import functools
 import itertools
 import numpy.random
 import srsly
+import catalogue
 import sys

 try:
@ -27,29 +28,20 @@ except ImportError:

 from .symbols import ORTH
 from .compat import cupy, CudaStream, path2str, basestring_, unicode_
-from .compat import import_file, importlib_metadata
+from .compat import import_file
 from .errors import Errors, Warnings, deprecation_warning


-LANGUAGES = {}
-ARCHITECTURES = {}
 _data_path = Path(__file__).parent / "data"
 _PRINT_ENV = False


-# NB: Ony ever call this once! If called more than ince within the
-# function, test_issue1506 hangs and it's not 100% clear why.
-AVAILABLE_ENTRY_POINTS = importlib_metadata.entry_points()
-
-
-class ENTRY_POINTS(object):
-    """Available entry points to register extensions."""
-
-    factories = "spacy_factories"
-    languages = "spacy_languages"
-    displacy_colors = "spacy_displacy_colors"
-    lookups = "spacy_lookups"
-    architectures = "spacy_architectures"
+class registry(object):
+    languages = catalogue.create("spacy", "languages", entry_points=True)
+    architectures = catalogue.create("spacy", "architectures", entry_points=True)
+    lookups = catalogue.create("spacy", "lookups", entry_points=True)
+    factories = catalogue.create("spacy", "factories", entry_points=True)
+    displacy_colors = catalogue.create("spacy", "displacy_colors", entry_points=True)


 def set_env_log(value):
@ -65,8 +57,7 @@ def lang_class_is_loaded(lang):
    lang (unicode): Two-letter language code, e.g. 'en'.
    RETURNS (bool): Whether a Language class has been loaded.
    """
-    global LANGUAGES
-    return lang in LANGUAGES
+    return lang in registry.languages


 def get_lang_class(lang):
@ -75,19 +66,16 @@ def get_lang_class(lang):
    lang (unicode): Two-letter language code, e.g. 'en'.
    RETURNS (Language): Language class.
    """
-    global LANGUAGES
-    # Check if an entry point is exposed for the language code
-    entry_point = get_entry_point(ENTRY_POINTS.languages, lang)
-    if entry_point is not None:
-        LANGUAGES[lang] = entry_point
-        return entry_point
-    if lang not in LANGUAGES:
+    # Check if language is registered / entry point is available
+    if lang in registry.languages:
+        return registry.languages.get(lang)
+    else:
        try:
            module = importlib.import_module(".lang.%s" % lang, "spacy")
        except ImportError as err:
            raise ImportError(Errors.E048.format(lang=lang, err=err))
-        LANGUAGES[lang] = getattr(module, module.__all__[0])
-    return LANGUAGES[lang]
+        set_lang_class(lang, getattr(module, module.__all__[0]))
+    return registry.languages.get(lang)


 def set_lang_class(name, cls):
@ -96,8 +84,7 @@ def set_lang_class(name, cls):
    name (unicode): Name of Language class.
    cls (Language): Language class.
    """
-    global LANGUAGES
-    LANGUAGES[name] = cls
+    registry.languages.register(name, func=cls)


 def get_data_path(require_exists=True):
@ -121,49 +108,11 @@ def set_data_path(path):
    _data_path = ensure_path(path)


-def register_architecture(name, arch=None):
-    """Decorator to register an architecture. An architecture is a function
-    that returns a Thinc Model object.
-
-    name (unicode): The name of the architecture to register.
-    arch (Model): Optional architecture if function is called directly and
-        not used as a decorator.
-    RETURNS (callable): Function to register architecture.
-    """
-    global ARCHITECTURES
-    if arch is not None:
-        ARCHITECTURES[name] = arch
-        return arch
-
-    def do_registration(arch):
-        ARCHITECTURES[name] = arch
-        return arch
-
-    return do_registration
-
-
 def make_layer(arch_config):
-    arch_func = get_architecture(arch_config["arch"])
+    arch_func = registry.architectures.get(arch_config["arch"])
    return arch_func(arch_config["config"])


-def get_architecture(name):
-    """Get a model architecture function by name. Raises a KeyError if the
-    architecture is not found.
-
-    name (unicode): The mame of the architecture.
-    RETURNS (Model): The architecture.
-    """
-    # Check if an entry point is exposed for the architecture code
-    entry_point = get_entry_point(ENTRY_POINTS.architectures, name)
-    if entry_point is not None:
-        ARCHITECTURES[name] = entry_point
-    if name not in ARCHITECTURES:
-        names = ", ".join(sorted(ARCHITECTURES.keys()))
-        raise KeyError(Errors.E174.format(name=name, names=names))
-    return ARCHITECTURES[name]
-
-
 def ensure_path(path):
    """Ensure string is converted to a Path.

@ -327,34 +276,6 @@ def get_package_path(name):
    return Path(pkg.__file__).parent


-def get_entry_points(key):
-    """Get registered entry points from other packages for a given key, e.g.
-    'spacy_factories' and return them as a dictionary, keyed by name.
-
-    key (unicode): Entry point name.
-    RETURNS (dict): Entry points, keyed by name.
-    """
-    result = {}
-    for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
-        result[entry_point.name] = entry_point.load()
-    return result
-
-
-def get_entry_point(key, value, default=None):
-    """Check if registered entry point is available for a given name and
-    load it. Otherwise, return None.
-
-    key (unicode): Entry point name.
-    value (unicode): Name of entry point to load.
-    default: Optional default value to return.
-    RETURNS: The loaded entry point or None.
-    """
-    for entry_point in AVAILABLE_ENTRY_POINTS.get(key, []):
-        if entry_point.name == value:
-            return entry_point.load()
-    return default
-
-
 def is_in_jupyter():
    """Check if user is running spaCy from a Jupyter notebook by detecting the
    IPython kernel. Mainly used for the displaCy visualizer.
--- a/website/docs/api/docbin.md
+++ b/website/docs/api/docbin.md
@ -109,8 +109,8 @@ raise an error if the pre-defined attrs of the two `DocBin`s don't match.
 > doc_bin1.add(nlp("Hello world"))
 > doc_bin2 = DocBin(attrs=["LEMMA", "POS"])
 > doc_bin2.add(nlp("This is a sentence"))
-> merged_bins = doc_bin1.merge(doc_bin2)
-> assert len(merged_bins) == 2
+> doc_bin1.merge(doc_bin2)
+> assert len(doc_bin1) == 2
 > ```

 | Argument | Type     | Description                                 |
--- a/website/docs/usage/101/_named-entities.md
+++ b/website/docs/usage/101/_named-entities.md
@ -1,9 +1,10 @@
 A named entity is a "real-world object" that's assigned a name – for example, a
-person, a country, a product or a book title. spaCy can **recognize** 
-[various types](/api/annotation#named-entities) of named entities in a document,
-by asking the model for a **prediction**. Because models are statistical and
-strongly depend on the examples they were trained on, this doesn't always work
-_perfectly_ and might need some tuning later, depending on your use case.
+person, a country, a product or a book title. spaCy can **recognize
+[various types](/api/annotation#named-entities)** of named entities in a
+document, by asking the model for a **prediction**. Because models are
+statistical and strongly depend on the examples they were trained on, this
+doesn't always work _perfectly_ and might need some tuning later, depending on
+your use case.

 Named entities are available as the `ents` property of a `Doc`: