Merge branch 'develop' into nightly.spacy.io

2025-07-14 18:22:27 +03:00 · 2020-10-15 17:27:49 +02:00 · 2020-10-15 17:27:49 +02:00 · 7f440275ab
commit 7f440275ab
parent 653ae4396e 09dbbe75d7
31 changed files with 795 additions and 204 deletions
--- a/.github/contributors/Nuccy90.md
+++ b/.github/contributors/Nuccy90.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Elena Fano           |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 2020-09-21           |
 | GitHub username                | Nuccy90              |
 | Website (optional)             |                      |
--- a/.github/contributors/rahul1990gupta.md
+++ b/.github/contributors/rahul1990gupta.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           |  Rahul Gupta         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           |  28 July 2020        |
 | GitHub username                |  rahul1990gupta      |
 | Website (optional)             |                      |
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy-nightly"
-__version__ = "3.0.0a41"
+__version__ = "3.0.0rc1"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/lang/hi/lex_attrs.py
+++ b/spacy/lang/hi/lex_attrs.py
@ -10,23 +10,26 @@ _stem_suffixes = [
    ["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
    ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
 ]
 # fmt: on
-# reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system
+# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
 # reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
 # reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
-_num_words = [
+_one_to_ten = [
    "शून्य",
    "एक",
    "दो",
    "तीन",
    "चार",
-    "पांच",
+    "पांच", "पाँच",
    "छह",
    "सात",
    "आठ",
    "नौ",
    "दस",
 ]
 _eleven_to_beyond = [
    "ग्यारह",
    "बारह",
    "तेरह",
@ -37,13 +40,85 @@ _num_words = [
    "अठारह",
    "उन्नीस",
    "बीस",
    "इकीस", "इक्कीस",
    "बाईस",
    "तेइस",
    "चौबीस",
    "पच्चीस",
    "छब्बीस",
    "सताइस", "सत्ताइस",
    "अट्ठाइस",
    "उनतीस",
    "तीस",
    "इकतीस", "इकत्तीस",
    "बतीस", "बत्तीस",
    "तैंतीस",
    "चौंतीस",
    "पैंतीस",
    "छतीस", "छत्तीस",
    "सैंतीस",
    "अड़तीस",
    "उनतालीस", "उनत्तीस",
    "चालीस",
    "इकतालीस",
    "बयालीस",
    "तैतालीस",
    "चवालीस",
    "पैंतालीस",
    "छयालिस",
    "सैंतालीस",
    "अड़तालीस",
    "उनचास",
    "पचास",
    "इक्यावन",
    "बावन",
    "तिरपन", "तिरेपन",
    "चौवन", "चउवन",
    "पचपन",
    "छप्पन",
    "सतावन", "सत्तावन",
    "अठावन",
    "उनसठ",
    "साठ",
    "इकसठ",
    "बासठ",
    "तिरसठ", "तिरेसठ",
    "चौंसठ",
    "पैंसठ",
    "छियासठ",
    "सड़सठ",
    "अड़सठ",
    "उनहत्तर",
    "सत्तर",
    "इकहत्तर"
    "बहत्तर",
    "तिहत्तर",
    "चौहत्तर",
    "पचहत्तर",
    "छिहत्तर",
    "सतहत्तर",
    "अठहत्तर",
    "उन्नासी", "उन्यासी"
    "अस्सी",
    "इक्यासी",
    "बयासी",
    "तिरासी",
    "चौरासी",
    "पचासी",
    "छियासी",
    "सतासी",
    "अट्ठासी",
    "नवासी",
    "नब्बे",
    "इक्यानवे",
    "बानवे",
    "तिरानवे",
    "चौरानवे",
    "पचानवे",
    "छियानवे",
    "सतानवे",
    "अट्ठानवे",
    "निन्यानवे",
    "सौ",
    "हज़ार",
    "लाख",
@ -52,6 +127,23 @@ _num_words = [
    "खरब",
 ]
 _num_words = _one_to_ten + _eleven_to_beyond
 _ordinal_words_one_to_ten = [
    "प्रथम", "पहला",
    "द्वितीय", "दूसरा",
    "तृतीय", "तीसरा",
    "चौथा",
    "पांचवाँ",
    "छठा",
    "सातवाँ",
    "आठवाँ",
    "नौवाँ",
    "दसवाँ",
 ]
 _ordinal_suffix = "वाँ"
 # fmt: on
 def norm(string):
    # normalise base exceptions,  e.g. punctuation or currency symbols
@ -64,7 +156,7 @@ def norm(string):
    for suffix_group in reversed(_stem_suffixes):
        length = len(suffix_group[0])
        if len(string) <= length:
-            break
+            continue
        for suffix in suffix_group:
            if string.endswith(suffix):
                return string[:-length]
@ -74,7 +166,7 @@ def norm(string):
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
-    text = text.replace(", ", "").replace(".", "")
+    text = text.replace(",", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
@ -83,6 +175,14 @@ def like_num(text):
            return True
    if text.lower() in _num_words:
        return True
    # check ordinal numbers
    # reference: http://www.englishkitab.com/Vocabulary/Numbers.html
    if text in _ordinal_words_one_to_ten:
        return True
    if text.endswith(_ordinal_suffix):
        if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
            return True
    return False
--- a/spacy/lang/ta/examples.py
+++ b/spacy/lang/ta/examples.py
@ -19,4 +19,6 @@ sentences = [
    "தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
    "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
    "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
    "என்ன வேலை செய்கிறீர்கள்?",
    "எந்த கல்லூரியில் படிக்கிறாய்?",
 ]
--- a/spacy/lang/tr/lex_attrs.py
+++ b/spacy/lang/tr/lex_attrs.py
@ -73,20 +73,16 @@ def like_num(text):
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    text_lower = text.lower()
    # Check cardinal number
    if text_lower in _num_words:
        return True
    # Check ordinal number
    if text_lower in _ordinal_words:
        return True
    if text_lower.endswith(_ordinal_endings):
        if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
            return True
    return False
--- a/spacy/lang/tr/syntax_iterators.py
+++ b/spacy/lang/tr/syntax_iterators.py
@ -1,6 +1,3 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -125,6 +125,11 @@ def he_tokenizer():
    return get_lang_class("he")().tokenizer
@pytest.fixture(scope="session")
 def hi_tokenizer():
    return get_lang_class("hi")().tokenizer
@pytest.fixture(scope="session")
 def hr_tokenizer():
    return get_lang_class("hr")().tokenizer
@ -240,11 +245,6 @@ def tr_tokenizer():
    return get_lang_class("tr")().tokenizer
@pytest.fixture(scope="session")
 def tr_vocab():
    return get_lang_class("tr").Defaults.create_vocab()
@pytest.fixture(scope="session")
 def tt_tokenizer():
    return get_lang_class("tt")().tokenizer
@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
                "segmenter": "pkuseg",
            }
        },
-        "initialize": {
+        "initialize": {"tokenizer": {"pkuseg_model": "web"}},
            "tokenizer": {
                "pkuseg_model": "web",
            }
        },
    }
    nlp = get_lang_class("zh").from_config(config)
    nlp.initialize()
--- a/spacy/tests/lang/hi/init.py
+++ b/spacy/tests/lang/hi/init.py
--- a/spacy/tests/lang/hi/test_lex_attrs.py
+++ b/spacy/tests/lang/hi/test_lex_attrs.py
@ -0,0 +1,43 @@
 import pytest
 from spacy.lang.hi.lex_attrs import norm, like_num
 def test_hi_tokenizer_handles_long_text(hi_tokenizer):
    text = """
 ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका
 छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की
 पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस
 में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है।
 """
    tokens = hi_tokenizer(text)
    assert len(tokens) == 86
@pytest.mark.parametrize(
    "word,word_norm",
    [
        ("चलता", "चल"),
        ("पढ़ाई", "पढ़"),
        ("देती", "दे"),
        ("जाती", "ज"),
        ("मुस्कुराकर", "मुस्कुर"),
    ],
 )
 def test_hi_norm(word, word_norm):
    assert norm(word) == word_norm
@pytest.mark.parametrize(
    "word",
    ["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
 )
 def test_hi_like_num(word):
    assert like_num(word)
@pytest.mark.parametrize(
    "word",
    ["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"],
 )
 def test_hi_like_num_ordinal_words(word):
    assert like_num(word)
--- a/spacy/tests/parser/test_ner.py
+++ b/spacy/tests/parser/test_ner.py
@ -1,4 +1,7 @@
 import pytest
 from numpy.testing import assert_equal
 from spacy.attrs import ENT_IOB
 from spacy import util
 from spacy.lang.en import English
 from spacy.language import Language
@ -332,6 +335,19 @@ def test_overfitting_IO():
        assert ents2[0].text == "London"
        assert ents2[0].label_ == "LOC"
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Just a sentence.",
        "Then one more sentence about London.",
        "Here is another one.",
        "I like London.",
    ]
    batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
 def test_ner_warns_no_lookups(caplog):
    nlp = English()
--- a/spacy/tests/parser/test_parse.py
+++ b/spacy/tests/parser/test_parse.py
@ -1,4 +1,7 @@
 import pytest
 from numpy.testing import assert_equal
 from spacy.attrs import DEP
 from spacy.lang.en import English
 from spacy.training import Example
 from spacy.tokens import Doc
@ -210,3 +213,16 @@ def test_overfitting_IO():
        assert doc2[0].dep_ == "nsubj"
        assert doc2[2].dep_ == "dobj"
        assert doc2[3].dep_ == "punct"
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Just a sentence.",
        "Then one more sentence about London.",
        "Here is another one.",
        "I like London.",
    ]
    batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_entity_linker.py
+++ b/spacy/tests/pipeline/test_entity_linker.py
@ -1,5 +1,7 @@
 from typing import Callable, Iterable
 import pytest
 from numpy.testing import assert_equal
 from spacy.attrs import ENT_KB_ID
 from spacy.kb import KnowledgeBase, get_candidates, Candidate
 from spacy.vocab import Vocab
@ -496,6 +498,19 @@ def test_overfitting_IO():
                predictions.append(ent.kb_id_)
        assert predictions == GOLD_entities
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Russ Cochran captured his first major title with his son as caddie.",
        "Russ Cochran his reprints include EC Comics.",
        "Russ Cochran has been publishing comic art.",
        "Russ Cochran was a member of University of Kentucky's golf team.",
    ]
    batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
 def test_kb_serialization():
    # Test that the KB can be used in a pipeline with a different vocab
--- a/spacy/tests/pipeline/test_models.py
+++ b/spacy/tests/pipeline/test_models.py
@ -0,0 +1,107 @@
 from typing import List
 import numpy
 import pytest
 from numpy.testing import assert_almost_equal
 from spacy.vocab import Vocab
 from thinc.api import NumpyOps, Model, data_validation
 from thinc.types import Array2d, Ragged
 from spacy.lang.en import English
 from spacy.ml import FeatureExtractor, StaticVectors
 from spacy.ml._character_embed import CharacterEmbed
 from spacy.tokens import Doc
 OPS = NumpyOps()
 texts = ["These are 4 words", "Here just three"]
 l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
 l1 = [[9, 8], [7, 6], [5, 4]]
 list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")]
 list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")]
 array = OPS.xp.asarray(l1, dtype="f")
 ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i"))
 def get_docs():
    vocab = Vocab()
    for t in texts:
        for word in t.split():
            hash_id = vocab.strings.add(word)
            vector = numpy.random.uniform(-1, 1, (7,))
            vocab.set_vector(hash_id, vector)
    docs = [English(vocab)(t) for t in texts]
    return docs
 # Test components with a model of type Model[List[Doc], List[Floats2d]]
@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"])
 def test_components_batching_list(name):
    nlp = English()
    proc = nlp.create_pipe(name)
    util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats)
 # Test components with a model of type Model[List[Doc], Floats2d]
@pytest.mark.parametrize("name", ["textcat"])
 def test_components_batching_array(name):
    nlp = English()
    proc = nlp.create_pipe(name)
    util_batch_unbatch_docs_array(proc.model, get_docs(), array)
 LAYERS = [
    (CharacterEmbed(nM=5, nC=3), get_docs(), list_floats),
    (FeatureExtractor([100, 200]), get_docs(), list_ints),
    (StaticVectors(), get_docs(), ragged),
 ]
@pytest.mark.parametrize("model,in_data,out_data", LAYERS)
 def test_layers_batching_all(model, in_data, out_data):
    # In = List[Doc]
    if isinstance(in_data, list) and isinstance(in_data[0], Doc):
        if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2:
            util_batch_unbatch_docs_array(model, in_data, out_data)
        elif (
            isinstance(out_data, list)
            and isinstance(out_data[0], OPS.xp.ndarray)
            and out_data[0].ndim == 2
        ):
            util_batch_unbatch_docs_list(model, in_data, out_data)
        elif isinstance(out_data, Ragged):
            util_batch_unbatch_docs_ragged(model, in_data, out_data)
 def util_batch_unbatch_docs_list(
    model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d]
 ):
    with data_validation(True):
        model.initialize(in_data, out_data)
        Y_batched = model.predict(in_data)
        Y_not_batched = [model.predict([u])[0] for u in in_data]
        for i in range(len(Y_batched)):
            assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
 def util_batch_unbatch_docs_array(
    model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d
 ):
    with data_validation(True):
        model.initialize(in_data, out_data)
        Y_batched = model.predict(in_data).tolist()
        Y_not_batched = [model.predict([u])[0] for u in in_data]
        assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
 def util_batch_unbatch_docs_ragged(
    model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged
 ):
    with data_validation(True):
        model.initialize(in_data, out_data)
        Y_batched = model.predict(in_data)
        Y_not_batched = []
        for u in in_data:
            Y_not_batched.extend(model.predict([u]).data.tolist())
        assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
--- a/spacy/tests/pipeline/test_morphologizer.py
+++ b/spacy/tests/pipeline/test_morphologizer.py
@ -1,4 +1,5 @@
 import pytest
 from numpy.testing import assert_equal
 from spacy import util
 from spacy.training import Example
@ -6,6 +7,7 @@ from spacy.lang.en import English
 from spacy.language import Language
 from spacy.tests.util import make_tempdir
 from spacy.morphology import Morphology
 from spacy.attrs import MORPH
 def test_label_types():
@ -101,3 +103,16 @@ def test_overfitting_IO():
        doc2 = nlp2(test_text)
        assert [str(t.morph) for t in doc2] == gold_morphs
        assert [t.pos_ for t in doc2] == gold_pos_tags
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Just a sentence.",
        "Then one more sentence about London.",
        "Here is another one.",
        "I like London.",
    ]
    batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_senter.py
+++ b/spacy/tests/pipeline/test_senter.py
@ -1,4 +1,6 @@
 import pytest
 from numpy.testing import assert_equal
 from spacy.attrs import SENT_START
 from spacy import util
 from spacy.training import Example
@ -80,3 +82,18 @@ def test_overfitting_IO():
        nlp2 = util.load_model_from_path(tmp_dir)
        doc2 = nlp2(test_text)
        assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Just a sentence.",
        "Then one more sentence about London.",
        "Here is another one.",
        "I like London.",
    ]
    batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
    no_batch_deps = [
        doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts]
    ]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -1,4 +1,7 @@
 import pytest
 from numpy.testing import assert_equal
 from spacy.attrs import TAG
 from spacy import util
 from spacy.training import Example
 from spacy.lang.en import English
@ -117,6 +120,19 @@ def test_overfitting_IO():
        assert doc2[2].tag_ is "J"
        assert doc2[3].tag_ is "N"
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = [
        "Just a sentence.",
        "I like green eggs.",
        "Here is another one.",
        "I eat ham.",
    ]
    batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
 def test_tagger_requires_labels():
    nlp = English()
--- a/spacy/tests/pipeline/test_textcat.py
+++ b/spacy/tests/pipeline/test_textcat.py
@ -1,6 +1,7 @@
 import pytest
 import random
 import numpy.random
 from numpy.testing import assert_equal
 from thinc.api import fix_random_seed
 from spacy import util
 from spacy.lang.en import English
@ -174,6 +175,14 @@ def test_overfitting_IO():
    assert scores["cats_score"] == 1.0
    assert "cats_score_desc" in scores
    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
    texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
    batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
    batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
    no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
    assert_equal(batch_deps_1, batch_deps_2)
    assert_equal(batch_deps_1, no_batch_deps)
 # fmt: off
@pytest.mark.parametrize(
--- a/spacy/tests/regression/test_issue5501-6000.py
+++ b/spacy/tests/regression/test_issue5501-6000.py
@ -0,0 +1,76 @@
 from thinc.api import fix_random_seed
 from spacy.lang.en import English
 from spacy.tokens import Span
 from spacy import displacy
 from spacy.pipeline import merge_entities
 def test_issue5551():
    """Test that after fixing the random seed, the results of the pipeline are truly identical"""
    component = "textcat"
    pipe_cfg = {
        "model": {
            "@architectures": "spacy.TextCatBOW.v1",
            "exclusive_classes": True,
            "ngram_size": 2,
            "no_output_layer": False,
        }
    }
    results = []
    for i in range(3):
        fix_random_seed(0)
        nlp = English()
        example = (
            "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
            {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
        )
        pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
        for label in set(example[1]["cats"]):
            pipe.add_label(label)
        nlp.initialize()
        # Store the result of each iteration
        result = pipe.model.predict([nlp.make_doc(example[0])])
        results.append(list(result[0]))
    # All results should be the same because of the fixed seed
    assert len(results) == 3
    assert results[0] == results[1]
    assert results[0] == results[2]
 def test_issue5838():
    # Displacy's EntityRenderer break line
    # not working after last entity
    sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n"
    nlp = English()
    doc = nlp(sample_text)
    doc.ents = [Span(doc, 7, 8, label="test")]
    html = displacy.render(doc, style="ent")
    found = html.count("</br>")
    assert found == 4
 def test_issue5918():
    # Test edge case when merging entities.
    nlp = English()
    ruler = nlp.add_pipe("entity_ruler")
    patterns = [
        {"label": "ORG", "pattern": "Digicon Inc"},
        {"label": "ORG", "pattern": "Rotan Mosle Inc's"},
        {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
    ]
    ruler.add_patterns(patterns)
    text = """
        Digicon Inc said it has completed the previously-announced disposition
        of its computer systems division to an investment group led by
        Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
        """
    doc = nlp(text)
    assert len(doc.ents) == 3
    # make it so that the third span's head is within the entity (ent_iob=I)
    # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
    # TODO: test for logging here
    # with pytest.warns(UserWarning):
    #     doc[29].head = doc[33]
    doc = merge_entities(doc)
    assert len(doc.ents) == 3
--- a/spacy/tests/regression/test_issue5551.py
+++ b/spacy/tests/regression/test_issue5551.py
@ -1,37 +0,0 @@
 from spacy.lang.en import English
 from spacy.util import fix_random_seed
 def test_issue5551():
    """Test that after fixing the random seed, the results of the pipeline are truly identical"""
    component = "textcat"
    pipe_cfg = {
        "model": {
            "@architectures": "spacy.TextCatBOW.v1",
            "exclusive_classes": True,
            "ngram_size": 2,
            "no_output_layer": False,
        }
    }
    results = []
    for i in range(3):
        fix_random_seed(0)
        nlp = English()
        example = (
            "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
            {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
        )
        pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
        for label in set(example[1]["cats"]):
            pipe.add_label(label)
        nlp.initialize()
        # Store the result of each iteration
        result = pipe.model.predict([nlp.make_doc(example[0])])
        results.append(list(result[0]))
    # All results should be the same because of the fixed seed
    assert len(results) == 3
    assert results[0] == results[1]
    assert results[0] == results[2]
--- a/spacy/tests/regression/test_issue5838.py
+++ b/spacy/tests/regression/test_issue5838.py
@ -1,23 +0,0 @@
 from spacy.lang.en import English
 from spacy.tokens import Span
 from spacy import displacy
 SAMPLE_TEXT = """First line
 Second line, with ent
 Third line
 Fourth line
 """
 def test_issue5838():
    # Displacy's EntityRenderer break line
    # not working after last entity
    nlp = English()
    doc = nlp(SAMPLE_TEXT)
    doc.ents = [Span(doc, 7, 8, label="test")]
    html = displacy.render(doc, style="ent")
    found = html.count("</br>")
    assert found == 4
--- a/spacy/tests/regression/test_issue5918.py
+++ b/spacy/tests/regression/test_issue5918.py
@ -1,29 +0,0 @@
 from spacy.lang.en import English
 from spacy.pipeline import merge_entities
 def test_issue5918():
    # Test edge case when merging entities.
    nlp = English()
    ruler = nlp.add_pipe("entity_ruler")
    patterns = [
        {"label": "ORG", "pattern": "Digicon Inc"},
        {"label": "ORG", "pattern": "Rotan Mosle Inc's"},
        {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
    ]
    ruler.add_patterns(patterns)
    text = """
        Digicon Inc said it has completed the previously-announced disposition
        of its computer systems division to an investment group led by
        Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
        """
    doc = nlp(text)
    assert len(doc.ents) == 3
    # make it so that the third span's head is within the entity (ent_iob=I)
    # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
    # TODO: test for logging here
    # with pytest.warns(UserWarning):
    #     doc[29].head = doc[33]
    doc = merge_entities(doc)
    assert len(doc.ents) == 3
--- a/spacy/tests/serialize/test_resource_warning.py
+++ b/spacy/tests/serialize/test_resource_warning.py
--- a/spacy/training/gold_io.pyx
+++ b/spacy/training/gold_io.pyx
@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
        docs = [docs]
    json_doc = {"id": doc_id, "paragraphs": []}
    for i, doc in enumerate(docs):
-        json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
+        raw = None if doc.has_unknown_spaces else doc.text
        json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []}
        for cat, val in doc.cats.items():
            json_cat = {"label": cat, "value": val}
            json_para["cats"].append(json_cat)
--- a/spacy/training/loop.py
+++ b/spacy/training/loop.py
@ -112,10 +112,10 @@ def train(
                    nlp.to_disk(final_model_path)
            else:
                nlp.to_disk(final_model_path)
-    # This will only run if we don't hit an error
+            # This will only run if we don't hit an error
-    stdout.write(
+            stdout.write(
-        msg.good("Saved pipeline to output directory", final_model_path) + "\n"
+                msg.good("Saved pipeline to output directory", final_model_path) + "\n"
-    )
+            )
 def train_while_improving(
--- a/website/docs/usage/_benchmarks-models.md
+++ b/website/docs/usage/_benchmarks-models.md
@ -1,19 +1,18 @@
 import { Help } from 'components/typography'; import Link from 'components/link'
 <!-- TODO: update speed and v2 NER numbers -->
 <figure>
-| Pipeline                                                   | Parser | Tagger |  NER | WPS<br />CPU <Help>words per second on CPU, higher is better</Help> | WPS<br/>GPU <Help>words per second on GPU, higher is better</Help> |
+| Pipeline                                                   | Parser | Tagger |  NER |
-| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: |
+| ---------------------------------------------------------- | -----: | -----: | ---: |
-| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.5 |   98.3 | 89.7 |                                                                  1k |                                                                 8k |
+| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.5 |   98.3 | 89.4 |
-| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.2 |   97.4 | 85.8 |                                                                  7k |                                                                    |
+| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.2 |   97.4 | 85.4 |
-| `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 |      |                                                                 10k |                                                                    |
+| `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 | 85.5 |
 <figcaption class="caption">
 **Full pipeline accuracy and speed** on the
-[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
+[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on
 the development set).
 </figcaption>
@ -21,14 +20,11 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 <figure>
-| Named Entity Recognition System                                                | OntoNotes | CoNLL '03 |
+| Named Entity Recognition System  | OntoNotes | CoNLL '03 |
-| ------------------------------------------------------------------------------ | --------: | --------: |
+| -------------------------------- | --------: | --------: |
-| spaCy RoBERTa (2020)                                                           |      89.7 |      91.6 |
+| spaCy RoBERTa (2020)             |      89.7 |      91.6 |
-| spaCy CNN (2020)                                                               |      84.5 |           |
+| Stanza (StanfordNLP)<sup>1</sup> |      88.8 |      92.1 |
-| spaCy CNN (2017)                                                               |           |           |
+| Flair<sup>2</sup>                |      89.7 |      93.1 |
 | [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)<sup>1</sup>      |      88.8 |      92.1 |
 | <Link to="https://github.com/flairNLP/flair" hideIcon>Flair</Link><sup>2</sup> |      89.7 |      93.1 |
 | BERT Base<sup>3</sup>                                                          |         - |      92.4 |
 <figcaption class="caption">
@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
 [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
 [NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
-more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf).
+more results. Project template:
-**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3.
+[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. **
-** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805).
+[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. **
 [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).
 </figcaption>
--- a/website/docs/usage/facts-figures.md
+++ b/website/docs/usage/facts-figures.md
@ -10,6 +10,18 @@ menu:
 ## Comparison {#comparison hidden="true"}
 spaCy is a **free, open-source library** for advanced **Natural Language
 Processing** (NLP) in Python. It's designed specifically for **production use**
 and helps you build applications that process and "understand" large volumes of
 text. It can be used to build information extraction or natural language
 understanding systems.
 ### Feature overview {#comparison-features}
 import Features from 'widgets/features.js'
 <Features />
 ### When should I use spaCy? {#comparison-usage}
 - ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 | Dependency Parsing System                                                      |  UAS |  LAS |
 | ------------------------------------------------------------------------------ | ---: | ---: |
-| spaCy RoBERTa (2020)<sup>1</sup>                                               | 95.5 | 94.3 |
+| spaCy RoBERTa (2020)                                                           | 95.5 | 94.3 |
 | spaCy CNN (2020)<sup>1</sup>                                                   |      |      |
 | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |
@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 **Dependency parsing accuracy** on the Penn Treebank. See
 [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
-results. **1. ** Project template:
+results. Project template:
 [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
 </figcaption>
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -489,11 +489,11 @@ This allows you to write callbacks that consider the entire set of matched
 phrases, so that you can resolve overlaps and other conflicts in whatever way
 you prefer.
-| Argument  | Description                                                                                                                                        |
+| Argument  | Description                                                                                                                                       |
-| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `matcher` | The matcher instance. ~~Matcher~~                                                                                                                  |
+| `matcher` | The matcher instance. ~~Matcher~~                                                                                                                 |
-| `doc`     | The document the matcher was used on. ~~Doc~~                                                                                                      |
+| `doc`     | The document the matcher was used on. ~~Doc~~                                                                                                     |
-| `i`       | Index of the current match (`matches[i`]). ~~int~~                                                                                                 |
+| `i`       | Index of the current match (`matches[i`]). ~~int~~                                                                                                |
 | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
 ### Creating spans from matches {#matcher-spans}
@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
 containing a match and render them with the
 [displaCy visualizer](/usage/visualizers). In the callback function, you'll have
 access to the `start` and `end` of each match, as well as the parent `Doc`. This
-lets you determine the sentence containing the match, `doc[start:end].sent`,
+lets you determine the sentence containing the match, `doc[start:end].sent`, and
-and calculate the start and end of the matched span within the sentence. Using
+calculate the start and end of the matched span within the sentence. Using
 displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
 list of dictionaries containing the text and entities to render.
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 <Benchmarks />
 #### New trained transformer-based pipelines {#features-transformers-pipelines}
 > #### Notes on model capabilities
 >
 > The models are each trained with a **single transformer** shared across the
 > pipeline, which requires it to be trained on a single corpus. For
 > [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
 > corpus, which has annotations across several tasks. For [French](/models/fr),
 > [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
 > corpus that had both syntactic and entity annotations, so the transformer
 > models for those languages do not include NER.
 | Package                                          | Language | Transformer                                                                                   | Tagger | Parser |  NER |
 | ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
 | [`en_core_web_trf`](/models/en#en_core_web_trf)  | English  | [`roberta-base`](https://huggingface.co/roberta-base)                                         |   97.8 |   95.0 | 89.4 |
 | [`de_dep_news_trf`](/models/de#de_dep_news_trf)  | German   | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased)                     |   99.0 |   95.8 |    - |
 | [`es_dep_news_trf`](/models/es#es_dep_news_trf)  | Spanish  | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) |   98.2 |   94.6 |    - |
 | [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf)  | French   | [`camembert-base`](https://huggingface.co/camembert-base)                                     |   95.7 |   94.9 |    - |
 | [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese  | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese)                               |   92.5 |   77.2 | 75.6 |
 <Infobox title="Details & Documentation" emoji="📖" list>
 - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
  [TransformerListener](/api/architectures#TransformerListener),
  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
 - **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf),
  [`de_dep_news_trf`](/models/de#de_dep_news_trf),
  [`es_dep_news_trf`](/models/es#es_dep_news_trf),
  [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf),
  [`zh_core_web_trf`](/models/zh#zh_core_web_trf)
 - **Implementation:**
  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
--- a/website/src/widgets/features.js
+++ b/website/src/widgets/features.js
@ -0,0 +1,72 @@
 import React from 'react'
 import { graphql, StaticQuery } from 'gatsby'
 import { Ul, Li } from '../components/list'
 export default () => (
    <StaticQuery
        query={query}
        render={({ site }) => {
            const { counts } = site.siteMetadata
            return (
                <Ul>
                    <Li>
                        ✅ Support for <strong>{counts.langs}+ languages</strong>
                    </Li>
                    <Li>
                        ✅ <strong>{counts.models} trained pipelines</strong> for{' '}
                        {counts.modelLangs} languages
                    </Li>
                    <Li>
                        ✅ Multi-task learning with pretrained <strong>transformers</strong> like
                        BERT
                    </Li>
                    <Li>
                        ✅ Pretrained <strong>word vectors</strong>
                    </Li>
                    <Li>✅ State-of-the-art speed</Li>
                    <Li>
                        ✅ Production-ready <strong>training system</strong>
                    </Li>
                    <Li>
                        ✅ Linguistically-motivated <strong>tokenization</strong>
                    </Li>
                    <Li>
                        ✅ Components for <strong>named entity</strong> recognition, part-of-speech
                        tagging, dependency parsing, sentence segmentation,{' '}
                        <strong>text classification</strong>, lemmatization, morphological analysis,
                        entity linking and more
                    </Li>
                    <Li>
                        ✅ Easily extensible with <strong>custom components</strong> and attributes
                    </Li>
                    <Li>
                        ✅ Support for custom models in <strong>PyTorch</strong>,{' '}
                        <strong>TensorFlow</strong> and other frameworks
                    </Li>
                    <Li>
                        ✅ Built in <strong>visualizers</strong> for syntax and NER
                    </Li>
                    <Li>
                        ✅ Easy <strong>model packaging</strong>, deployment and workflow management
                    </Li>
                    <Li>✅ Robust, rigorously evaluated accuracy</Li>
                </Ul>
            )
        }}
    />
 )
 const query = graphql`
    query FeaturesQuery {
        site {
            siteMetadata {
                counts {
                    langs
                    modelLangs
                    models
                }
            }
        }
    }
 `
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@ -14,13 +14,13 @@ import {
    LandingBanner,
 } from '../components/landing'
 import { H2 } from '../components/typography'
 import { Ul, Li } from '../components/list'
 import { InlineCode } from '../components/code'
 import Button from '../components/button'
 import Link from '../components/link'
 import QuickstartTraining from './quickstart-training'
 import Project from './project'
 import Features from './features'
 import courseImage from '../../docs/images/course.jpg'
 import prodigyImage from '../../docs/images/prodigy_overview.jpg'
 import projectsImage from '../../docs/images/projects.png'
@ -56,7 +56,7 @@ for entity in doc.ents:
 }
 const Landing = ({ data }) => {
-    const { counts, nightly } = data
+    const { nightly } = data
    const codeExample = getCodeExample(nightly)
    return (
        <>
@ -98,51 +98,7 @@ const Landing = ({ data }) => {
                <LandingCol>
                    <H2>Features</H2>
-                    <Ul>
+                    <Features />
                        <Li>
                            ✅ Support for <strong>{counts.langs}+ languages</strong>
                        </Li>
                        <Li>
                            ✅ <strong>{counts.models} trained pipelines</strong> for{' '}
                            {counts.modelLangs} languages
                        </Li>
                        <Li>
                            ✅ Multi-task learning with pretrained <strong>transformers</strong>{' '}
                            like BERT
                        </Li>
                        <Li>
                            ✅ Pretrained <strong>word vectors</strong>
                        </Li>
                        <Li>✅ State-of-the-art speed</Li>
                        <Li>
                            ✅ Production-ready <strong>training system</strong>
                        </Li>
                        <Li>
                            ✅ Linguistically-motivated <strong>tokenization</strong>
                        </Li>
                        <Li>
                            ✅ Components for <strong>named entity</strong> recognition,
                            part-of-speech tagging, dependency parsing, sentence segmentation,{' '}
                            <strong>text classification</strong>, lemmatization, morphological
                            analysis, entity linking and more
                        </Li>
                        <Li>
                            ✅ Easily extensible with <strong>custom components</strong> and
                            attributes
                        </Li>
                        <Li>
                            ✅ Support for custom models in <strong>PyTorch</strong>,{' '}
                            <strong>TensorFlow</strong> and other frameworks
                        </Li>
                        <Li>
                            ✅ Built in <strong>visualizers</strong> for syntax and NER
                        </Li>
                        <Li>
                            ✅ Easy <strong>model packaging</strong>, deployment and workflow
                            management
                        </Li>
                        <Li>✅ Robust, rigorously evaluated accuracy</Li>
                    </Ul>
                </LandingCol>
            </LandingGrid>
@ -333,11 +289,6 @@ const landingQuery = graphql`
            siteMetadata {
                nightly
                repo
                counts {
                    langs
                    modelLangs
                    models
                }
            }
        }
    }