Merge branch 'develop' into nightly.spacy.io

2025-12-10 19:54:17 +03:00 · 2020-10-15 17:27:49 +02:00 · 2020-10-15 17:27:49 +02:00 · 7f440275ab
commit 7f440275ab
parent 653ae4396e 09dbbe75d7
31 changed files with 795 additions and 204 deletions
--- a/.github/contributors/Nuccy90.md
+++ b/.github/contributors/Nuccy90.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           | Elena Fano           |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           | 2020-09-21           |
+| GitHub username                | Nuccy90              |
+| Website (optional)             |                      |
--- a/.github/contributors/rahul1990gupta.md
+++ b/.github/contributors/rahul1990gupta.md
@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+    * you hereby assign to us joint ownership, and to the extent that such
+    assignment is or becomes invalid, ineffective or unenforceable, you hereby
+    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+    royalty-free, unrestricted license to exercise all rights under those
+    copyrights. This includes, at our option, the right to sublicense these same
+    rights to third parties through multiple levels of sublicensees or other
+    licensing arrangements;
+
+    * you agree that each of us can do all things in relation to your
+    contribution as if each of us were the sole owners, and if one of us makes
+    a derivative work of your contribution, the one who makes the derivative
+    work (or has it made will be the sole owner of that derivative work;
+
+    * you agree that you will not assert any moral rights in your contribution
+    against us, our licensees or transferees;
+
+    * you agree that we may register a copyright in your contribution and
+    exercise all ownership rights associated with it; and
+
+    * you agree that neither of us has any duty to consult with, obtain the
+    consent of, pay or render an accounting to the other for any use or
+    distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+    * make, have made, use, sell, offer to sell, import, and otherwise transfer
+    your contribution in whole or in part, alone or in combination with or
+    included in any product, work or materials arising out of the project to
+    which your contribution was submitted, and
+
+    * at our option, to sublicense these same rights to third parties through
+    multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+    * Each contribution that you submit is and shall be an original work of
+    authorship and you can legally grant the rights set out in this SCA;
+
+    * to the best of your knowledge, each contribution will not violate any
+    third party's copyrights, trademarks, patents, or other intellectual
+    property rights; and
+
+    * each contribution shall be in compliance with U.S. export control laws and
+    other applicable export and import laws. You agree to notify us if you
+    become aware of any circumstance which would make any of the foregoing
+    representations inaccurate in any respect. We may publicly disclose your
+    participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+    * [x] I am signing on behalf of myself as an individual and no other person
+    or entity, including my employer, has or will have rights with respect to my
+    contributions.
+
+    * [ ] I am signing on behalf of my employer or a legal entity and I have the
+    actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field                          | Entry                |
+|------------------------------- | -------------------- |
+| Name                           |  Rahul Gupta         |
+| Company name (if applicable)   |                      |
+| Title or role (if applicable)  |                      |
+| Date                           |  28 July 2020        |
+| GitHub username                |  rahul1990gupta      |
+| Website (optional)             |                      |
--- a/spacy/about.py
+++ b/spacy/about.py
@ -1,6 +1,6 @@
 # fmt: off
 __title__ = "spacy-nightly"
-__version__ = "3.0.0a41"
+__version__ = "3.0.0rc1"
 __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
 __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
 __projects__ = "https://github.com/explosion/projects"
--- a/spacy/lang/hi/lex_attrs.py
+++ b/spacy/lang/hi/lex_attrs.py
@ -10,23 +10,26 @@ _stem_suffixes = [
    ["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
    ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
 ]
-# fmt: on

 # reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
 # reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
+# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/

-_num_words = [
+_one_to_ten = [
    "शून्य",
    "एक",
    "दो",
    "तीन",
    "चार",
-    "पांच",
+    "पांच", "पाँच",
    "छह",
    "सात",
    "आठ",
    "नौ",
    "दस",
+]
+
+_eleven_to_beyond = [
    "ग्यारह",
    "बारह",
    "तेरह",
@ -37,13 +40,85 @@ _num_words = [
    "अठारह",
    "उन्नीस",
    "बीस",
+    "इकीस", "इक्कीस",
+    "बाईस",
+    "तेइस",
+    "चौबीस",
+    "पच्चीस",
+    "छब्बीस",
+    "सताइस", "सत्ताइस",
+    "अट्ठाइस",
+    "उनतीस",
    "तीस",
+    "इकतीस", "इकत्तीस",
+    "बतीस", "बत्तीस",
+    "तैंतीस",
+    "चौंतीस",
+    "पैंतीस",
+    "छतीस", "छत्तीस",
+    "सैंतीस",
+    "अड़तीस",
+    "उनतालीस", "उनत्तीस",
    "चालीस",
+    "इकतालीस",
+    "बयालीस",
+    "तैतालीस",
+    "चवालीस",
+    "पैंतालीस",
+    "छयालिस",
+    "सैंतालीस",
+    "अड़तालीस",
+    "उनचास",
    "पचास",
+    "इक्यावन",
+    "बावन",
+    "तिरपन", "तिरेपन",
+    "चौवन", "चउवन",
+    "पचपन",
+    "छप्पन",
+    "सतावन", "सत्तावन",
+    "अठावन",
+    "उनसठ",
    "साठ",
+    "इकसठ",
+    "बासठ",
+    "तिरसठ", "तिरेसठ",
+    "चौंसठ",
+    "पैंसठ",
+    "छियासठ",
+    "सड़सठ",
+    "अड़सठ",
+    "उनहत्तर",
    "सत्तर",
+    "इकहत्तर"
+    "बहत्तर",
+    "तिहत्तर",
+    "चौहत्तर",
+    "पचहत्तर",
+    "छिहत्तर",
+    "सतहत्तर",
+    "अठहत्तर",
+    "उन्नासी", "उन्यासी"
    "अस्सी",
+    "इक्यासी",
+    "बयासी",
+    "तिरासी",
+    "चौरासी",
+    "पचासी",
+    "छियासी",
+    "सतासी",
+    "अट्ठासी",
+    "नवासी",
    "नब्बे",
+    "इक्यानवे",
+    "बानवे",
+    "तिरानवे",
+    "चौरानवे",
+    "पचानवे",
+    "छियानवे",
+    "सतानवे",
+    "अट्ठानवे",
+    "निन्यानवे",
    "सौ",
    "हज़ार",
    "लाख",
@ -52,6 +127,23 @@ _num_words = [
    "खरब",
 ]

+_num_words = _one_to_ten + _eleven_to_beyond
+
+_ordinal_words_one_to_ten = [
+    "प्रथम", "पहला",
+    "द्वितीय", "दूसरा",
+    "तृतीय", "तीसरा",
+    "चौथा",
+    "पांचवाँ",
+    "छठा",
+    "सातवाँ",
+    "आठवाँ",
+    "नौवाँ",
+    "दसवाँ",
+]
+_ordinal_suffix = "वाँ"
+# fmt: on
+

 def norm(string):
    # normalise base exceptions,  e.g. punctuation or currency symbols
@ -64,7 +156,7 @@ def norm(string):
    for suffix_group in reversed(_stem_suffixes):
        length = len(suffix_group[0])
        if len(string) <= length:
-            break
+            continue
        for suffix in suffix_group:
            if string.endswith(suffix):
                return string[:-length]
@ -83,6 +175,14 @@ def like_num(text):
            return True
    if text.lower() in _num_words:
        return True
+
+    # check ordinal numbers
+    # reference: http://www.englishkitab.com/Vocabulary/Numbers.html
+    if text in _ordinal_words_one_to_ten:
+        return True
+    if text.endswith(_ordinal_suffix):
+        if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
+            return True
    return False


--- a/spacy/lang/ta/examples.py
+++ b/spacy/lang/ta/examples.py
@ -19,4 +19,6 @@ sentences = [
    "தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
    "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
    "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
+    "என்ன வேலை செய்கிறீர்கள்?",
+    "எந்த கல்லூரியில் படிக்கிறாய்?",
 ]
--- a/spacy/lang/tr/lex_attrs.py
+++ b/spacy/lang/tr/lex_attrs.py
@ -73,20 +73,16 @@ def like_num(text):
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
-
    text_lower = text.lower()
-
    # Check cardinal number
    if text_lower in _num_words:
        return True
-
    # Check ordinal number
    if text_lower in _ordinal_words:
        return True
    if text_lower.endswith(_ordinal_endings):
        if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
            return True
-
    return False


--- a/spacy/lang/tr/syntax_iterators.py
+++ b/spacy/lang/tr/syntax_iterators.py
@ -1,6 +1,3 @@
-# coding: utf8
-from __future__ import unicode_literals
-
 from ...symbols import NOUN, PROPN, PRON
 from ...errors import Errors

--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -125,6 +125,11 @@ def he_tokenizer():
    return get_lang_class("he")().tokenizer


+@pytest.fixture(scope="session")
+def hi_tokenizer():
+    return get_lang_class("hi")().tokenizer
+
+
@pytest.fixture(scope="session")
 def hr_tokenizer():
    return get_lang_class("hr")().tokenizer
@ -240,11 +245,6 @@ def tr_tokenizer():
    return get_lang_class("tr")().tokenizer


-@pytest.fixture(scope="session")
-def tr_vocab():
-    return get_lang_class("tr").Defaults.create_vocab()
-
-
@pytest.fixture(scope="session")
 def tt_tokenizer():
    return get_lang_class("tt")().tokenizer
@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
                "segmenter": "pkuseg",
            }
        },
-        "initialize": {
-            "tokenizer": {
-                "pkuseg_model": "web",
-            }
-        },
+        "initialize": {"tokenizer": {"pkuseg_model": "web"}},
    }
    nlp = get_lang_class("zh").from_config(config)
    nlp.initialize()
--- a/spacy/tests/lang/hi/init.py
+++ b/spacy/tests/lang/hi/init.py
--- a/spacy/tests/lang/hi/test_lex_attrs.py
+++ b/spacy/tests/lang/hi/test_lex_attrs.py
@ -0,0 +1,43 @@
+import pytest
+from spacy.lang.hi.lex_attrs import norm, like_num
+
+
+def test_hi_tokenizer_handles_long_text(hi_tokenizer):
+    text = """
+ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका
+छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की
+पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस
+में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है।
+"""
+    tokens = hi_tokenizer(text)
+    assert len(tokens) == 86
+
+
+@pytest.mark.parametrize(
+    "word,word_norm",
+    [
+        ("चलता", "चल"),
+        ("पढ़ाई", "पढ़"),
+        ("देती", "दे"),
+        ("जाती", "ज"),
+        ("मुस्कुराकर", "मुस्कुर"),
+    ],
+)
+def test_hi_norm(word, word_norm):
+    assert norm(word) == word_norm
+
+
+@pytest.mark.parametrize(
+    "word",
+    ["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
+)
+def test_hi_like_num(word):
+    assert like_num(word)
+
+
+@pytest.mark.parametrize(
+    "word",
+    ["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"],
+)
+def test_hi_like_num_ordinal_words(word):
+    assert like_num(word)
--- a/spacy/tests/parser/test_ner.py
+++ b/spacy/tests/parser/test_ner.py
@ -1,4 +1,7 @@
 import pytest
+from numpy.testing import assert_equal
+from spacy.attrs import ENT_IOB
+
 from spacy import util
 from spacy.lang.en import English
 from spacy.language import Language
@ -332,6 +335,19 @@ def test_overfitting_IO():
        assert ents2[0].text == "London"
        assert ents2[0].label_ == "LOC"

+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Just a sentence.",
+        "Then one more sentence about London.",
+        "Here is another one.",
+        "I like London.",
+    ]
+    batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
+

 def test_ner_warns_no_lookups(caplog):
    nlp = English()
--- a/spacy/tests/parser/test_parse.py
+++ b/spacy/tests/parser/test_parse.py
@ -1,4 +1,7 @@
 import pytest
+from numpy.testing import assert_equal
+from spacy.attrs import DEP
+
 from spacy.lang.en import English
 from spacy.training import Example
 from spacy.tokens import Doc
@ -210,3 +213,16 @@ def test_overfitting_IO():
        assert doc2[0].dep_ == "nsubj"
        assert doc2[2].dep_ == "dobj"
        assert doc2[3].dep_ == "punct"
+
+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Just a sentence.",
+        "Then one more sentence about London.",
+        "Here is another one.",
+        "I like London.",
+    ]
+    batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_entity_linker.py
+++ b/spacy/tests/pipeline/test_entity_linker.py
@ -1,5 +1,7 @@
 from typing import Callable, Iterable
 import pytest
+from numpy.testing import assert_equal
+from spacy.attrs import ENT_KB_ID

 from spacy.kb import KnowledgeBase, get_candidates, Candidate
 from spacy.vocab import Vocab
@ -496,6 +498,19 @@ def test_overfitting_IO():
                predictions.append(ent.kb_id_)
        assert predictions == GOLD_entities

+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Russ Cochran captured his first major title with his son as caddie.",
+        "Russ Cochran his reprints include EC Comics.",
+        "Russ Cochran has been publishing comic art.",
+        "Russ Cochran was a member of University of Kentucky's golf team.",
+    ]
+    batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
+

 def test_kb_serialization():
    # Test that the KB can be used in a pipeline with a different vocab
--- a/spacy/tests/pipeline/test_models.py
+++ b/spacy/tests/pipeline/test_models.py
@ -0,0 +1,107 @@
+from typing import List
+
+import numpy
+import pytest
+from numpy.testing import assert_almost_equal
+from spacy.vocab import Vocab
+from thinc.api import NumpyOps, Model, data_validation
+from thinc.types import Array2d, Ragged
+
+from spacy.lang.en import English
+from spacy.ml import FeatureExtractor, StaticVectors
+from spacy.ml._character_embed import CharacterEmbed
+from spacy.tokens import Doc
+
+
+OPS = NumpyOps()
+
+texts = ["These are 4 words", "Here just three"]
+l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
+l1 = [[9, 8], [7, 6], [5, 4]]
+list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")]
+list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")]
+array = OPS.xp.asarray(l1, dtype="f")
+ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i"))
+
+
+def get_docs():
+    vocab = Vocab()
+    for t in texts:
+        for word in t.split():
+            hash_id = vocab.strings.add(word)
+            vector = numpy.random.uniform(-1, 1, (7,))
+            vocab.set_vector(hash_id, vector)
+    docs = [English(vocab)(t) for t in texts]
+    return docs
+
+
+# Test components with a model of type Model[List[Doc], List[Floats2d]]
+@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"])
+def test_components_batching_list(name):
+    nlp = English()
+    proc = nlp.create_pipe(name)
+    util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats)
+
+
+# Test components with a model of type Model[List[Doc], Floats2d]
+@pytest.mark.parametrize("name", ["textcat"])
+def test_components_batching_array(name):
+    nlp = English()
+    proc = nlp.create_pipe(name)
+    util_batch_unbatch_docs_array(proc.model, get_docs(), array)
+
+
+LAYERS = [
+    (CharacterEmbed(nM=5, nC=3), get_docs(), list_floats),
+    (FeatureExtractor([100, 200]), get_docs(), list_ints),
+    (StaticVectors(), get_docs(), ragged),
+]
+
+
+@pytest.mark.parametrize("model,in_data,out_data", LAYERS)
+def test_layers_batching_all(model, in_data, out_data):
+    # In = List[Doc]
+    if isinstance(in_data, list) and isinstance(in_data[0], Doc):
+        if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2:
+            util_batch_unbatch_docs_array(model, in_data, out_data)
+        elif (
+            isinstance(out_data, list)
+            and isinstance(out_data[0], OPS.xp.ndarray)
+            and out_data[0].ndim == 2
+        ):
+            util_batch_unbatch_docs_list(model, in_data, out_data)
+        elif isinstance(out_data, Ragged):
+            util_batch_unbatch_docs_ragged(model, in_data, out_data)
+
+
+def util_batch_unbatch_docs_list(
+    model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d]
+):
+    with data_validation(True):
+        model.initialize(in_data, out_data)
+        Y_batched = model.predict(in_data)
+        Y_not_batched = [model.predict([u])[0] for u in in_data]
+        for i in range(len(Y_batched)):
+            assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
+
+
+def util_batch_unbatch_docs_array(
+    model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d
+):
+    with data_validation(True):
+        model.initialize(in_data, out_data)
+        Y_batched = model.predict(in_data).tolist()
+        Y_not_batched = [model.predict([u])[0] for u in in_data]
+        assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
+
+
+def util_batch_unbatch_docs_ragged(
+    model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged
+):
+    with data_validation(True):
+        model.initialize(in_data, out_data)
+        Y_batched = model.predict(in_data)
+        Y_not_batched = []
+        for u in in_data:
+            Y_not_batched.extend(model.predict([u]).data.tolist())
+        assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
--- a/spacy/tests/pipeline/test_morphologizer.py
+++ b/spacy/tests/pipeline/test_morphologizer.py
@ -1,4 +1,5 @@
 import pytest
+from numpy.testing import assert_equal

 from spacy import util
 from spacy.training import Example
@ -6,6 +7,7 @@ from spacy.lang.en import English
 from spacy.language import Language
 from spacy.tests.util import make_tempdir
 from spacy.morphology import Morphology
+from spacy.attrs import MORPH


 def test_label_types():
@ -101,3 +103,16 @@ def test_overfitting_IO():
        doc2 = nlp2(test_text)
        assert [str(t.morph) for t in doc2] == gold_morphs
        assert [t.pos_ for t in doc2] == gold_pos_tags
+
+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Just a sentence.",
+        "Then one more sentence about London.",
+        "Here is another one.",
+        "I like London.",
+    ]
+    batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_senter.py
+++ b/spacy/tests/pipeline/test_senter.py
@ -1,4 +1,6 @@
 import pytest
+from numpy.testing import assert_equal
+from spacy.attrs import SENT_START

 from spacy import util
 from spacy.training import Example
@ -80,3 +82,18 @@ def test_overfitting_IO():
        nlp2 = util.load_model_from_path(tmp_dir)
        doc2 = nlp2(test_text)
        assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
+
+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Just a sentence.",
+        "Then one more sentence about London.",
+        "Here is another one.",
+        "I like London.",
+    ]
+    batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [
+        doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts]
+    ]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -1,4 +1,7 @@
 import pytest
+from numpy.testing import assert_equal
+from spacy.attrs import TAG
+
 from spacy import util
 from spacy.training import Example
 from spacy.lang.en import English
@ -117,6 +120,19 @@ def test_overfitting_IO():
        assert doc2[2].tag_ is "J"
        assert doc2[3].tag_ is "N"

+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = [
+        "Just a sentence.",
+        "I like green eggs.",
+        "Here is another one.",
+        "I eat ham.",
+    ]
+    batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
+

 def test_tagger_requires_labels():
    nlp = English()
--- a/spacy/tests/pipeline/test_textcat.py
+++ b/spacy/tests/pipeline/test_textcat.py
@ -1,6 +1,7 @@
 import pytest
 import random
 import numpy.random
+from numpy.testing import assert_equal
 from thinc.api import fix_random_seed
 from spacy import util
 from spacy.lang.en import English
@ -174,6 +175,14 @@ def test_overfitting_IO():
    assert scores["cats_score"] == 1.0
    assert "cats_score_desc" in scores

+    # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
+    texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
+    batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
+    batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
+    no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
+    assert_equal(batch_deps_1, batch_deps_2)
+    assert_equal(batch_deps_1, no_batch_deps)
+

 # fmt: off
@pytest.mark.parametrize(
--- a/spacy/tests/regression/test_issue5501-6000.py
+++ b/spacy/tests/regression/test_issue5501-6000.py
@ -0,0 +1,76 @@
+from thinc.api import fix_random_seed
+from spacy.lang.en import English
+from spacy.tokens import Span
+from spacy import displacy
+from spacy.pipeline import merge_entities
+
+
+def test_issue5551():
+    """Test that after fixing the random seed, the results of the pipeline are truly identical"""
+    component = "textcat"
+    pipe_cfg = {
+        "model": {
+            "@architectures": "spacy.TextCatBOW.v1",
+            "exclusive_classes": True,
+            "ngram_size": 2,
+            "no_output_layer": False,
+        }
+    }
+    results = []
+    for i in range(3):
+        fix_random_seed(0)
+        nlp = English()
+        example = (
+            "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
+            {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
+        )
+        pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
+        for label in set(example[1]["cats"]):
+            pipe.add_label(label)
+        nlp.initialize()
+        # Store the result of each iteration
+        result = pipe.model.predict([nlp.make_doc(example[0])])
+        results.append(list(result[0]))
+    # All results should be the same because of the fixed seed
+    assert len(results) == 3
+    assert results[0] == results[1]
+    assert results[0] == results[2]
+
+
+def test_issue5838():
+    # Displacy's EntityRenderer break line
+    # not working after last entity
+    sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n"
+    nlp = English()
+    doc = nlp(sample_text)
+    doc.ents = [Span(doc, 7, 8, label="test")]
+    html = displacy.render(doc, style="ent")
+    found = html.count("</br>")
+    assert found == 4
+
+
+def test_issue5918():
+    # Test edge case when merging entities.
+    nlp = English()
+    ruler = nlp.add_pipe("entity_ruler")
+    patterns = [
+        {"label": "ORG", "pattern": "Digicon Inc"},
+        {"label": "ORG", "pattern": "Rotan Mosle Inc's"},
+        {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
+    ]
+    ruler.add_patterns(patterns)
+
+    text = """
+        Digicon Inc said it has completed the previously-announced disposition
+        of its computer systems division to an investment group led by
+        Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
+        """
+    doc = nlp(text)
+    assert len(doc.ents) == 3
+    # make it so that the third span's head is within the entity (ent_iob=I)
+    # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
+    # TODO: test for logging here
+    # with pytest.warns(UserWarning):
+    #     doc[29].head = doc[33]
+    doc = merge_entities(doc)
+    assert len(doc.ents) == 3
--- a/spacy/tests/regression/test_issue5551.py
+++ b/spacy/tests/regression/test_issue5551.py
@ -1,37 +0,0 @@
-from spacy.lang.en import English
-from spacy.util import fix_random_seed
-
-
-def test_issue5551():
-    """Test that after fixing the random seed, the results of the pipeline are truly identical"""
-    component = "textcat"
-    pipe_cfg = {
-        "model": {
-            "@architectures": "spacy.TextCatBOW.v1",
-            "exclusive_classes": True,
-            "ngram_size": 2,
-            "no_output_layer": False,
-        }
-    }
-
-    results = []
-    for i in range(3):
-        fix_random_seed(0)
-        nlp = English()
-        example = (
-            "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
-            {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
-        )
-        pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
-        for label in set(example[1]["cats"]):
-            pipe.add_label(label)
-        nlp.initialize()
-
-        # Store the result of each iteration
-        result = pipe.model.predict([nlp.make_doc(example[0])])
-        results.append(list(result[0]))
-
-    # All results should be the same because of the fixed seed
-    assert len(results) == 3
-    assert results[0] == results[1]
-    assert results[0] == results[2]
--- a/spacy/tests/regression/test_issue5838.py
+++ b/spacy/tests/regression/test_issue5838.py
@ -1,23 +0,0 @@
-from spacy.lang.en import English
-from spacy.tokens import Span
-from spacy import displacy
-
-
-SAMPLE_TEXT = """First line
-Second line, with ent
-Third line
-Fourth line
-"""
-
-
-def test_issue5838():
-    # Displacy's EntityRenderer break line
-    # not working after last entity
-
-    nlp = English()
-    doc = nlp(SAMPLE_TEXT)
-    doc.ents = [Span(doc, 7, 8, label="test")]
-
-    html = displacy.render(doc, style="ent")
-    found = html.count("</br>")
-    assert found == 4
--- a/spacy/tests/regression/test_issue5918.py
+++ b/spacy/tests/regression/test_issue5918.py
@ -1,29 +0,0 @@
-from spacy.lang.en import English
-from spacy.pipeline import merge_entities
-
-
-def test_issue5918():
-    # Test edge case when merging entities.
-    nlp = English()
-    ruler = nlp.add_pipe("entity_ruler")
-    patterns = [
-        {"label": "ORG", "pattern": "Digicon Inc"},
-        {"label": "ORG", "pattern": "Rotan Mosle Inc's"},
-        {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
-    ]
-    ruler.add_patterns(patterns)
-
-    text = """
-        Digicon Inc said it has completed the previously-announced disposition
-        of its computer systems division to an investment group led by
-        Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
-        """
-    doc = nlp(text)
-    assert len(doc.ents) == 3
-    # make it so that the third span's head is within the entity (ent_iob=I)
-    # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
-    # TODO: test for logging here
-    # with pytest.warns(UserWarning):
-    #     doc[29].head = doc[33]
-    doc = merge_entities(doc)
-    assert len(doc.ents) == 3
--- a/spacy/tests/serialize/test_resource_warning.py
+++ b/spacy/tests/serialize/test_resource_warning.py
--- a/spacy/training/gold_io.pyx
+++ b/spacy/training/gold_io.pyx
@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
        docs = [docs]
    json_doc = {"id": doc_id, "paragraphs": []}
    for i, doc in enumerate(docs):
-        json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
+        raw = None if doc.has_unknown_spaces else doc.text
+        json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []}
        for cat, val in doc.cats.items():
            json_cat = {"label": cat, "value": val}
            json_para["cats"].append(json_cat)
--- a/website/docs/usage/_benchmarks-models.md
+++ b/website/docs/usage/_benchmarks-models.md
@ -1,19 +1,18 @@
 import { Help } from 'components/typography'; import Link from 'components/link'

-<!-- TODO: update speed and v2 NER numbers -->
-
 <figure>

-| Pipeline                                                   | Parser | Tagger |  NER | WPS<br />CPU <Help>words per second on CPU, higher is better</Help> | WPS<br/>GPU <Help>words per second on GPU, higher is better</Help> |
-| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: |
-| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.5 |   98.3 | 89.7 |                                                                  1k |                                                                 8k |
-| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.2 |   97.4 | 85.8 |                                                                  7k |                                                                    |
-| `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 |      |                                                                 10k |                                                                    |
+| Pipeline                                                   | Parser | Tagger |  NER |
+| ---------------------------------------------------------- | -----: | -----: | ---: |
+| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) |   95.5 |   98.3 | 89.4 |
+| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3)   |   92.2 |   97.4 | 85.4 |
+| `en_core_web_lg` (spaCy v2)                                |   91.9 |   97.2 | 85.5 |

 <figcaption class="caption">

 **Full pipeline accuracy and speed** on the
-[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
+[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on
+the development set).

 </figcaption>

@ -22,13 +21,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 <figure>

 | Named Entity Recognition System  | OntoNotes | CoNLL '03 |
-| ------------------------------------------------------------------------------ | --------: | --------: |
+| -------------------------------- | --------: | --------: |
 | spaCy RoBERTa (2020)             |      89.7 |      91.6 |
-| spaCy CNN (2020)                                                               |      84.5 |           |
-| spaCy CNN (2017)                                                               |           |           |
-| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)<sup>1</sup>      |      88.8 |      92.1 |
-| <Link to="https://github.com/flairNLP/flair" hideIcon>Flair</Link><sup>2</sup> |      89.7 |      93.1 |
-| BERT Base<sup>3</sup>                                                          |         - |      92.4 |
+| Stanza (StanfordNLP)<sup>1</sup> |      88.8 |      92.1 |
+| Flair<sup>2</sup>                |      89.7 |      93.1 |

 <figcaption class="caption">

@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
 [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
 [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
 [NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
-more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf).
-**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3.
-** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805).
+more results. Project template:
+[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. **
+[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. **
+[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).

 </figcaption>

--- a/website/docs/usage/facts-figures.md
+++ b/website/docs/usage/facts-figures.md
@ -10,6 +10,18 @@ menu:

 ## Comparison {#comparison hidden="true"}

+spaCy is a **free, open-source library** for advanced **Natural Language
+Processing** (NLP) in Python. It's designed specifically for **production use**
+and helps you build applications that process and "understand" large volumes of
+text. It can be used to build information extraction or natural language
+understanding systems.
+
+### Feature overview {#comparison-features}
+
+import Features from 'widgets/features.js'
+
+<Features />
+
 ### When should I use spaCy? {#comparison-usage}

 - ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'

 | Dependency Parsing System                                                      |  UAS |  LAS |
 | ------------------------------------------------------------------------------ | ---: | ---: |
-| spaCy RoBERTa (2020)<sup>1</sup>                                               | 95.5 | 94.3 |
-| spaCy CNN (2020)<sup>1</sup>                                                   |      |      |
+| spaCy RoBERTa (2020)                                                           | 95.5 | 94.3 |
 | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
 | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019)             | 97.2 | 95.7 |

@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'

 **Dependency parsing accuracy** on the Penn Treebank. See
 [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
-results. **1. ** Project template:
+results. Project template:
 [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).

 </figcaption>
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -490,7 +490,7 @@ phrases, so that you can resolve overlaps and other conflicts in whatever way
 you prefer.

 | Argument  | Description                                                                                                                                       |
-| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
+| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `matcher` | The matcher instance. ~~Matcher~~                                                                                                                 |
 | `doc`     | The document the matcher was used on. ~~Doc~~                                                                                                     |
 | `i`       | Index of the current match (`matches[i`]). ~~int~~                                                                                                |
@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
 containing a match and render them with the
 [displaCy visualizer](/usage/visualizers). In the callback function, you'll have
 access to the `start` and `end` of each match, as well as the parent `Doc`. This
-lets you determine the sentence containing the match, `doc[start:end].sent`,
-and calculate the start and end of the matched span within the sentence. Using
+lets you determine the sentence containing the match, `doc[start:end].sent`, and
+calculate the start and end of the matched span within the sentence. Using
 displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
 list of dictionaries containing the text and entities to render.

--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md'

 <Benchmarks />

+#### New trained transformer-based pipelines {#features-transformers-pipelines}
+
+> #### Notes on model capabilities
+>
+> The models are each trained with a **single transformer** shared across the
+> pipeline, which requires it to be trained on a single corpus. For
+> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
+> corpus, which has annotations across several tasks. For [French](/models/fr),
+> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
+> corpus that had both syntactic and entity annotations, so the transformer
+> models for those languages do not include NER.
+
+| Package                                          | Language | Transformer                                                                                   | Tagger | Parser |  NER |
+| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
+| [`en_core_web_trf`](/models/en#en_core_web_trf)  | English  | [`roberta-base`](https://huggingface.co/roberta-base)                                         |   97.8 |   95.0 | 89.4 |
+| [`de_dep_news_trf`](/models/de#de_dep_news_trf)  | German   | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased)                     |   99.0 |   95.8 |    - |
+| [`es_dep_news_trf`](/models/es#es_dep_news_trf)  | Spanish  | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) |   98.2 |   94.6 |    - |
+| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf)  | French   | [`camembert-base`](https://huggingface.co/camembert-base)                                     |   95.7 |   94.9 |    - |
+| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese  | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese)                               |   92.5 |   77.2 | 75.6 |
+
 <Infobox title="Details & Documentation" emoji="📖" list>

 - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
 - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
  [TransformerListener](/api/architectures#TransformerListener),
  [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf),
-  [`de_dep_news_trf`](/models/de#de_dep_news_trf),
-  [`es_dep_news_trf`](/models/es#es_dep_news_trf),
-  [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf),
-  [`zh_core_web_trf`](/models/zh#zh_core_web_trf)
 - **Implementation:**
  [`spacy-transformers`](https://github.com/explosion/spacy-transformers)

--- a/website/src/widgets/features.js
+++ b/website/src/widgets/features.js
@ -0,0 +1,72 @@
+import React from 'react'
+import { graphql, StaticQuery } from 'gatsby'
+
+import { Ul, Li } from '../components/list'
+
+export default () => (
+    <StaticQuery
+        query={query}
+        render={({ site }) => {
+            const { counts } = site.siteMetadata
+            return (
+                <Ul>
+                    <Li>
+                        ✅ Support for <strong>{counts.langs}+ languages</strong>
+                    </Li>
+                    <Li>
+                        ✅ <strong>{counts.models} trained pipelines</strong> for{' '}
+                        {counts.modelLangs} languages
+                    </Li>
+                    <Li>
+                        ✅ Multi-task learning with pretrained <strong>transformers</strong> like
+                        BERT
+                    </Li>
+                    <Li>
+                        ✅ Pretrained <strong>word vectors</strong>
+                    </Li>
+                    <Li>✅ State-of-the-art speed</Li>
+                    <Li>
+                        ✅ Production-ready <strong>training system</strong>
+                    </Li>
+                    <Li>
+                        ✅ Linguistically-motivated <strong>tokenization</strong>
+                    </Li>
+                    <Li>
+                        ✅ Components for <strong>named entity</strong> recognition, part-of-speech
+                        tagging, dependency parsing, sentence segmentation,{' '}
+                        <strong>text classification</strong>, lemmatization, morphological analysis,
+                        entity linking and more
+                    </Li>
+                    <Li>
+                        ✅ Easily extensible with <strong>custom components</strong> and attributes
+                    </Li>
+                    <Li>
+                        ✅ Support for custom models in <strong>PyTorch</strong>,{' '}
+                        <strong>TensorFlow</strong> and other frameworks
+                    </Li>
+                    <Li>
+                        ✅ Built in <strong>visualizers</strong> for syntax and NER
+                    </Li>
+                    <Li>
+                        ✅ Easy <strong>model packaging</strong>, deployment and workflow management
+                    </Li>
+                    <Li>✅ Robust, rigorously evaluated accuracy</Li>
+                </Ul>
+            )
+        }}
+    />
+)
+
+const query = graphql`
+    query FeaturesQuery {
+        site {
+            siteMetadata {
+                counts {
+                    langs
+                    modelLangs
+                    models
+                }
+            }
+        }
+    }
+`
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@ -14,13 +14,13 @@ import {
    LandingBanner,
 } from '../components/landing'
 import { H2 } from '../components/typography'
-import { Ul, Li } from '../components/list'
 import { InlineCode } from '../components/code'
 import Button from '../components/button'
 import Link from '../components/link'

 import QuickstartTraining from './quickstart-training'
 import Project from './project'
+import Features from './features'
 import courseImage from '../../docs/images/course.jpg'
 import prodigyImage from '../../docs/images/prodigy_overview.jpg'
 import projectsImage from '../../docs/images/projects.png'
@ -56,7 +56,7 @@ for entity in doc.ents:
 }

 const Landing = ({ data }) => {
-    const { counts, nightly } = data
+    const { nightly } = data
    const codeExample = getCodeExample(nightly)
    return (
        <>
@ -98,51 +98,7 @@ const Landing = ({ data }) => {

                <LandingCol>
                    <H2>Features</H2>
-                    <Ul>
-                        <Li>
-                            ✅ Support for <strong>{counts.langs}+ languages</strong>
-                        </Li>
-                        <Li>
-                            ✅ <strong>{counts.models} trained pipelines</strong> for{' '}
-                            {counts.modelLangs} languages
-                        </Li>
-                        <Li>
-                            ✅ Multi-task learning with pretrained <strong>transformers</strong>{' '}
-                            like BERT
-                        </Li>
-                        <Li>
-                            ✅ Pretrained <strong>word vectors</strong>
-                        </Li>
-                        <Li>✅ State-of-the-art speed</Li>
-                        <Li>
-                            ✅ Production-ready <strong>training system</strong>
-                        </Li>
-                        <Li>
-                            ✅ Linguistically-motivated <strong>tokenization</strong>
-                        </Li>
-                        <Li>
-                            ✅ Components for <strong>named entity</strong> recognition,
-                            part-of-speech tagging, dependency parsing, sentence segmentation,{' '}
-                            <strong>text classification</strong>, lemmatization, morphological
-                            analysis, entity linking and more
-                        </Li>
-                        <Li>
-                            ✅ Easily extensible with <strong>custom components</strong> and
-                            attributes
-                        </Li>
-                        <Li>
-                            ✅ Support for custom models in <strong>PyTorch</strong>,{' '}
-                            <strong>TensorFlow</strong> and other frameworks
-                        </Li>
-                        <Li>
-                            ✅ Built in <strong>visualizers</strong> for syntax and NER
-                        </Li>
-                        <Li>
-                            ✅ Easy <strong>model packaging</strong>, deployment and workflow
-                            management
-                        </Li>
-                        <Li>✅ Robust, rigorously evaluated accuracy</Li>
-                    </Ul>
+                    <Features />
                </LandingCol>
            </LandingGrid>

@ -333,11 +289,6 @@ const landingQuery = graphql`
            siteMetadata {
                nightly
                repo
-                counts {
-                    langs
-                    modelLangs
-                    models
-                }
            }
        }
    }