diff --git a/.github/contributors/Nuccy90.md b/.github/contributors/Nuccy90.md new file mode 100644 index 000000000..2d1adb825 --- /dev/null +++ b/.github/contributors/Nuccy90.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Elena Fano | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 2020-09-21 | +| GitHub username | Nuccy90 | +| Website (optional) | | diff --git a/.github/contributors/rahul1990gupta.md b/.github/contributors/rahul1990gupta.md new file mode 100644 index 000000000..eab41b3b1 --- /dev/null +++ b/.github/contributors/rahul1990gupta.md @@ -0,0 +1,106 @@ +# spaCy contributor agreement + +This spaCy Contributor Agreement (**"SCA"**) is based on the +[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf). +The SCA applies to any contribution that you make to any product or project +managed by us (the **"project"**), and sets out the intellectual property rights +you grant to us in the contributed materials. The term **"us"** shall mean +[ExplosionAI GmbH](https://explosion.ai/legal). The term +**"you"** shall mean the person or entity identified below. + +If you agree to be bound by these terms, fill in the information requested +below and include the filled-in version with your first pull request, under the +folder [`.github/contributors/`](/.github/contributors/). The name of the file +should be your GitHub username, with the extension `.md`. For example, the user +example_user would create the file `.github/contributors/example_user.md`. + +Read this agreement carefully before signing. These terms and conditions +constitute a binding legal agreement. + +## Contributor Agreement + +1. The term "contribution" or "contributed materials" means any source code, +object code, patch, tool, sample, graphic, specification, manual, +documentation, or any other material posted or submitted by you to the project. + +2. With respect to any worldwide copyrights, or copyright applications and +registrations, in your contribution: + + * you hereby assign to us joint ownership, and to the extent that such + assignment is or becomes invalid, ineffective or unenforceable, you hereby + grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge, + royalty-free, unrestricted license to exercise all rights under those + copyrights. This includes, at our option, the right to sublicense these same + rights to third parties through multiple levels of sublicensees or other + licensing arrangements; + + * you agree that each of us can do all things in relation to your + contribution as if each of us were the sole owners, and if one of us makes + a derivative work of your contribution, the one who makes the derivative + work (or has it made will be the sole owner of that derivative work; + + * you agree that you will not assert any moral rights in your contribution + against us, our licensees or transferees; + + * you agree that we may register a copyright in your contribution and + exercise all ownership rights associated with it; and + + * you agree that neither of us has any duty to consult with, obtain the + consent of, pay or render an accounting to the other for any use or + distribution of your contribution. + +3. With respect to any patents you own, or that you can license without payment +to any third party, you hereby grant to us a perpetual, irrevocable, +non-exclusive, worldwide, no-charge, royalty-free license to: + + * make, have made, use, sell, offer to sell, import, and otherwise transfer + your contribution in whole or in part, alone or in combination with or + included in any product, work or materials arising out of the project to + which your contribution was submitted, and + + * at our option, to sublicense these same rights to third parties through + multiple levels of sublicensees or other licensing arrangements. + +4. Except as set out above, you keep all right, title, and interest in your +contribution. The rights that you grant to us under these terms are effective +on the date you first submitted a contribution to us, even if your submission +took place before the date you sign these terms. + +5. You covenant, represent, warrant and agree that: + + * Each contribution that you submit is and shall be an original work of + authorship and you can legally grant the rights set out in this SCA; + + * to the best of your knowledge, each contribution will not violate any + third party's copyrights, trademarks, patents, or other intellectual + property rights; and + + * each contribution shall be in compliance with U.S. export control laws and + other applicable export and import laws. You agree to notify us if you + become aware of any circumstance which would make any of the foregoing + representations inaccurate in any respect. We may publicly disclose your + participation in the project, including the fact that you have signed the SCA. + +6. This SCA is governed by the laws of the State of California and applicable +U.S. Federal law. Any choice of law rules will not apply. + +7. Please place an “x” on one of the applicable statement below. Please do NOT +mark both statements: + + * [x] I am signing on behalf of myself as an individual and no other person + or entity, including my employer, has or will have rights with respect to my + contributions. + + * [ ] I am signing on behalf of my employer or a legal entity and I have the + actual authority to contractually bind that entity. + +## Contributor Details + +| Field | Entry | +|------------------------------- | -------------------- | +| Name | Rahul Gupta | +| Company name (if applicable) | | +| Title or role (if applicable) | | +| Date | 28 July 2020 | +| GitHub username | rahul1990gupta | +| Website (optional) | | diff --git a/spacy/about.py b/spacy/about.py index 9c5dd0b4f..bf1d53a7b 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy-nightly" -__version__ = "3.0.0a41" +__version__ = "3.0.0rc1" __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __projects__ = "https://github.com/explosion/projects" diff --git a/spacy/lang/hi/lex_attrs.py b/spacy/lang/hi/lex_attrs.py index 20a8c2975..a18c2e513 100644 --- a/spacy/lang/hi/lex_attrs.py +++ b/spacy/lang/hi/lex_attrs.py @@ -10,23 +10,26 @@ _stem_suffixes = [ ["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"], ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"] ] -# fmt: on -# reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system +# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system # reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/ +# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/ -_num_words = [ +_one_to_ten = [ "शून्य", "एक", "दो", "तीन", "चार", - "पांच", + "पांच", "पाँच", "छह", "सात", "आठ", "नौ", "दस", +] + +_eleven_to_beyond = [ "ग्यारह", "बारह", "तेरह", @@ -37,13 +40,85 @@ _num_words = [ "अठारह", "उन्नीस", "बीस", + "इकीस", "इक्कीस", + "बाईस", + "तेइस", + "चौबीस", + "पच्चीस", + "छब्बीस", + "सताइस", "सत्ताइस", + "अट्ठाइस", + "उनतीस", "तीस", + "इकतीस", "इकत्तीस", + "बतीस", "बत्तीस", + "तैंतीस", + "चौंतीस", + "पैंतीस", + "छतीस", "छत्तीस", + "सैंतीस", + "अड़तीस", + "उनतालीस", "उनत्तीस", "चालीस", + "इकतालीस", + "बयालीस", + "तैतालीस", + "चवालीस", + "पैंतालीस", + "छयालिस", + "सैंतालीस", + "अड़तालीस", + "उनचास", "पचास", + "इक्यावन", + "बावन", + "तिरपन", "तिरेपन", + "चौवन", "चउवन", + "पचपन", + "छप्पन", + "सतावन", "सत्तावन", + "अठावन", + "उनसठ", "साठ", + "इकसठ", + "बासठ", + "तिरसठ", "तिरेसठ", + "चौंसठ", + "पैंसठ", + "छियासठ", + "सड़सठ", + "अड़सठ", + "उनहत्तर", "सत्तर", + "इकहत्तर" + "बहत्तर", + "तिहत्तर", + "चौहत्तर", + "पचहत्तर", + "छिहत्तर", + "सतहत्तर", + "अठहत्तर", + "उन्नासी", "उन्यासी" "अस्सी", + "इक्यासी", + "बयासी", + "तिरासी", + "चौरासी", + "पचासी", + "छियासी", + "सतासी", + "अट्ठासी", + "नवासी", "नब्बे", + "इक्यानवे", + "बानवे", + "तिरानवे", + "चौरानवे", + "पचानवे", + "छियानवे", + "सतानवे", + "अट्ठानवे", + "निन्यानवे", "सौ", "हज़ार", "लाख", @@ -52,6 +127,23 @@ _num_words = [ "खरब", ] +_num_words = _one_to_ten + _eleven_to_beyond + +_ordinal_words_one_to_ten = [ + "प्रथम", "पहला", + "द्वितीय", "दूसरा", + "तृतीय", "तीसरा", + "चौथा", + "पांचवाँ", + "छठा", + "सातवाँ", + "आठवाँ", + "नौवाँ", + "दसवाँ", +] +_ordinal_suffix = "वाँ" +# fmt: on + def norm(string): # normalise base exceptions, e.g. punctuation or currency symbols @@ -64,7 +156,7 @@ def norm(string): for suffix_group in reversed(_stem_suffixes): length = len(suffix_group[0]) if len(string) <= length: - break + continue for suffix in suffix_group: if string.endswith(suffix): return string[:-length] @@ -74,7 +166,7 @@ def norm(string): def like_num(text): if text.startswith(("+", "-", "±", "~")): text = text[1:] - text = text.replace(", ", "").replace(".", "") + text = text.replace(",", "").replace(".", "") if text.isdigit(): return True if text.count("/") == 1: @@ -83,6 +175,14 @@ def like_num(text): return True if text.lower() in _num_words: return True + + # check ordinal numbers + # reference: http://www.englishkitab.com/Vocabulary/Numbers.html + if text in _ordinal_words_one_to_ten: + return True + if text.endswith(_ordinal_suffix): + if text[: -len(_ordinal_suffix)] in _eleven_to_beyond: + return True return False diff --git a/spacy/lang/ta/examples.py b/spacy/lang/ta/examples.py index c3c47e66e..e68dc6237 100644 --- a/spacy/lang/ta/examples.py +++ b/spacy/lang/ta/examples.py @@ -19,4 +19,6 @@ sentences = [ "தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன", "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது", "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.", + "என்ன வேலை செய்கிறீர்கள்?", + "எந்த கல்லூரியில் படிக்கிறாய்?", ] diff --git a/spacy/lang/tr/lex_attrs.py b/spacy/lang/tr/lex_attrs.py index d9e12c4aa..f7416837d 100644 --- a/spacy/lang/tr/lex_attrs.py +++ b/spacy/lang/tr/lex_attrs.py @@ -73,20 +73,16 @@ def like_num(text): num, denom = text.split("/") if num.isdigit() and denom.isdigit(): return True - text_lower = text.lower() - # Check cardinal number if text_lower in _num_words: return True - # Check ordinal number if text_lower in _ordinal_words: return True if text_lower.endswith(_ordinal_endings): if text_lower[:-3].isdigit() or text_lower[:-4].isdigit(): return True - return False diff --git a/spacy/lang/tr/syntax_iterators.py b/spacy/lang/tr/syntax_iterators.py index d9b342949..3fd726fb5 100644 --- a/spacy/lang/tr/syntax_iterators.py +++ b/spacy/lang/tr/syntax_iterators.py @@ -1,6 +1,3 @@ -# coding: utf8 -from __future__ import unicode_literals - from ...symbols import NOUN, PROPN, PRON from ...errors import Errors diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py index 3b0de899b..3733d345d 100644 --- a/spacy/tests/conftest.py +++ b/spacy/tests/conftest.py @@ -125,6 +125,11 @@ def he_tokenizer(): return get_lang_class("he")().tokenizer +@pytest.fixture(scope="session") +def hi_tokenizer(): + return get_lang_class("hi")().tokenizer + + @pytest.fixture(scope="session") def hr_tokenizer(): return get_lang_class("hr")().tokenizer @@ -240,11 +245,6 @@ def tr_tokenizer(): return get_lang_class("tr")().tokenizer -@pytest.fixture(scope="session") -def tr_vocab(): - return get_lang_class("tr").Defaults.create_vocab() - - @pytest.fixture(scope="session") def tt_tokenizer(): return get_lang_class("tt")().tokenizer @@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg(): "segmenter": "pkuseg", } }, - "initialize": { - "tokenizer": { - "pkuseg_model": "web", - } - }, + "initialize": {"tokenizer": {"pkuseg_model": "web"}}, } nlp = get_lang_class("zh").from_config(config) nlp.initialize() diff --git a/spacy/tests/lang/hi/__init__.py b/spacy/tests/lang/hi/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/spacy/tests/lang/hi/test_lex_attrs.py b/spacy/tests/lang/hi/test_lex_attrs.py new file mode 100644 index 000000000..80a7cc1c4 --- /dev/null +++ b/spacy/tests/lang/hi/test_lex_attrs.py @@ -0,0 +1,43 @@ +import pytest +from spacy.lang.hi.lex_attrs import norm, like_num + + +def test_hi_tokenizer_handles_long_text(hi_tokenizer): + text = """ +ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका +छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की +पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस +में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है। +""" + tokens = hi_tokenizer(text) + assert len(tokens) == 86 + + +@pytest.mark.parametrize( + "word,word_norm", + [ + ("चलता", "चल"), + ("पढ़ाई", "पढ़"), + ("देती", "दे"), + ("जाती", "ज"), + ("मुस्कुराकर", "मुस्कुर"), + ], +) +def test_hi_norm(word, word_norm): + assert norm(word) == word_norm + + +@pytest.mark.parametrize( + "word", + ["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"], +) +def test_hi_like_num(word): + assert like_num(word) + + +@pytest.mark.parametrize( + "word", + ["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"], +) +def test_hi_like_num_ordinal_words(word): + assert like_num(word) diff --git a/spacy/tests/parser/test_ner.py b/spacy/tests/parser/test_ner.py index b657ae2e8..b4c22b48d 100644 --- a/spacy/tests/parser/test_ner.py +++ b/spacy/tests/parser/test_ner.py @@ -1,4 +1,7 @@ import pytest +from numpy.testing import assert_equal +from spacy.attrs import ENT_IOB + from spacy import util from spacy.lang.en import English from spacy.language import Language @@ -332,6 +335,19 @@ def test_overfitting_IO(): assert ents2[0].text == "London" assert ents2[0].label_ == "LOC" + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + def test_ner_warns_no_lookups(caplog): nlp = English() diff --git a/spacy/tests/parser/test_parse.py b/spacy/tests/parser/test_parse.py index ffb6f23f1..a914eb17a 100644 --- a/spacy/tests/parser/test_parse.py +++ b/spacy/tests/parser/test_parse.py @@ -1,4 +1,7 @@ import pytest +from numpy.testing import assert_equal +from spacy.attrs import DEP + from spacy.lang.en import English from spacy.training import Example from spacy.tokens import Doc @@ -210,3 +213,16 @@ def test_overfitting_IO(): assert doc2[0].dep_ == "nsubj" assert doc2[2].dep_ == "dobj" assert doc2[3].dep_ == "punct" + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py index f2e6defcb..8ba2d0d3e 100644 --- a/spacy/tests/pipeline/test_entity_linker.py +++ b/spacy/tests/pipeline/test_entity_linker.py @@ -1,5 +1,7 @@ from typing import Callable, Iterable import pytest +from numpy.testing import assert_equal +from spacy.attrs import ENT_KB_ID from spacy.kb import KnowledgeBase, get_candidates, Candidate from spacy.vocab import Vocab @@ -496,6 +498,19 @@ def test_overfitting_IO(): predictions.append(ent.kb_id_) assert predictions == GOLD_entities + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Russ Cochran captured his first major title with his son as caddie.", + "Russ Cochran his reprints include EC Comics.", + "Russ Cochran has been publishing comic art.", + "Russ Cochran was a member of University of Kentucky's golf team.", + ] + batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + def test_kb_serialization(): # Test that the KB can be used in a pipeline with a different vocab diff --git a/spacy/tests/pipeline/test_models.py b/spacy/tests/pipeline/test_models.py new file mode 100644 index 000000000..d04ac9cd4 --- /dev/null +++ b/spacy/tests/pipeline/test_models.py @@ -0,0 +1,107 @@ +from typing import List + +import numpy +import pytest +from numpy.testing import assert_almost_equal +from spacy.vocab import Vocab +from thinc.api import NumpyOps, Model, data_validation +from thinc.types import Array2d, Ragged + +from spacy.lang.en import English +from spacy.ml import FeatureExtractor, StaticVectors +from spacy.ml._character_embed import CharacterEmbed +from spacy.tokens import Doc + + +OPS = NumpyOps() + +texts = ["These are 4 words", "Here just three"] +l0 = [[1, 2], [3, 4], [5, 6], [7, 8]] +l1 = [[9, 8], [7, 6], [5, 4]] +list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")] +list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")] +array = OPS.xp.asarray(l1, dtype="f") +ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i")) + + +def get_docs(): + vocab = Vocab() + for t in texts: + for word in t.split(): + hash_id = vocab.strings.add(word) + vector = numpy.random.uniform(-1, 1, (7,)) + vocab.set_vector(hash_id, vector) + docs = [English(vocab)(t) for t in texts] + return docs + + +# Test components with a model of type Model[List[Doc], List[Floats2d]] +@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"]) +def test_components_batching_list(name): + nlp = English() + proc = nlp.create_pipe(name) + util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats) + + +# Test components with a model of type Model[List[Doc], Floats2d] +@pytest.mark.parametrize("name", ["textcat"]) +def test_components_batching_array(name): + nlp = English() + proc = nlp.create_pipe(name) + util_batch_unbatch_docs_array(proc.model, get_docs(), array) + + +LAYERS = [ + (CharacterEmbed(nM=5, nC=3), get_docs(), list_floats), + (FeatureExtractor([100, 200]), get_docs(), list_ints), + (StaticVectors(), get_docs(), ragged), +] + + +@pytest.mark.parametrize("model,in_data,out_data", LAYERS) +def test_layers_batching_all(model, in_data, out_data): + # In = List[Doc] + if isinstance(in_data, list) and isinstance(in_data[0], Doc): + if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2: + util_batch_unbatch_docs_array(model, in_data, out_data) + elif ( + isinstance(out_data, list) + and isinstance(out_data[0], OPS.xp.ndarray) + and out_data[0].ndim == 2 + ): + util_batch_unbatch_docs_list(model, in_data, out_data) + elif isinstance(out_data, Ragged): + util_batch_unbatch_docs_ragged(model, in_data, out_data) + + +def util_batch_unbatch_docs_list( + model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d] +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data) + Y_not_batched = [model.predict([u])[0] for u in in_data] + for i in range(len(Y_batched)): + assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4) + + +def util_batch_unbatch_docs_array( + model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data).tolist() + Y_not_batched = [model.predict([u])[0] for u in in_data] + assert_almost_equal(Y_batched, Y_not_batched, decimal=4) + + +def util_batch_unbatch_docs_ragged( + model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged +): + with data_validation(True): + model.initialize(in_data, out_data) + Y_batched = model.predict(in_data) + Y_not_batched = [] + for u in in_data: + Y_not_batched.extend(model.predict([u]).data.tolist()) + assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4) diff --git a/spacy/tests/pipeline/test_morphologizer.py b/spacy/tests/pipeline/test_morphologizer.py index fd7aa05be..85d1d6c8b 100644 --- a/spacy/tests/pipeline/test_morphologizer.py +++ b/spacy/tests/pipeline/test_morphologizer.py @@ -1,4 +1,5 @@ import pytest +from numpy.testing import assert_equal from spacy import util from spacy.training import Example @@ -6,6 +7,7 @@ from spacy.lang.en import English from spacy.language import Language from spacy.tests.util import make_tempdir from spacy.morphology import Morphology +from spacy.attrs import MORPH def test_label_types(): @@ -101,3 +103,16 @@ def test_overfitting_IO(): doc2 = nlp2(test_text) assert [str(t.morph) for t in doc2] == gold_morphs assert [t.pos_ for t in doc2] == gold_pos_tags + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/pipeline/test_senter.py b/spacy/tests/pipeline/test_senter.py index c9722e5de..7a256f79b 100644 --- a/spacy/tests/pipeline/test_senter.py +++ b/spacy/tests/pipeline/test_senter.py @@ -1,4 +1,6 @@ import pytest +from numpy.testing import assert_equal +from spacy.attrs import SENT_START from spacy import util from spacy.training import Example @@ -80,3 +82,18 @@ def test_overfitting_IO(): nlp2 = util.load_model_from_path(tmp_dir) doc2 = nlp2(test_text) assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts + + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "Then one more sentence about London.", + "Here is another one.", + "I like London.", + ] + batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)] + no_batch_deps = [ + doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts] + ] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py index b9db76cdf..885bdbce1 100644 --- a/spacy/tests/pipeline/test_tagger.py +++ b/spacy/tests/pipeline/test_tagger.py @@ -1,4 +1,7 @@ import pytest +from numpy.testing import assert_equal +from spacy.attrs import TAG + from spacy import util from spacy.training import Example from spacy.lang.en import English @@ -117,6 +120,19 @@ def test_overfitting_IO(): assert doc2[2].tag_ is "J" assert doc2[3].tag_ is "N" + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = [ + "Just a sentence.", + "I like green eggs.", + "Here is another one.", + "I eat ham.", + ] + batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)] + no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + def test_tagger_requires_labels(): nlp = English() diff --git a/spacy/tests/pipeline/test_textcat.py b/spacy/tests/pipeline/test_textcat.py index dd2f1070b..91348b1b3 100644 --- a/spacy/tests/pipeline/test_textcat.py +++ b/spacy/tests/pipeline/test_textcat.py @@ -1,6 +1,7 @@ import pytest import random import numpy.random +from numpy.testing import assert_equal from thinc.api import fix_random_seed from spacy import util from spacy.lang.en import English @@ -174,6 +175,14 @@ def test_overfitting_IO(): assert scores["cats_score"] == 1.0 assert "cats_score_desc" in scores + # Make sure that running pipe twice, or comparing to call, always amounts to the same predictions + texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."] + batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)] + batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)] + no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]] + assert_equal(batch_deps_1, batch_deps_2) + assert_equal(batch_deps_1, no_batch_deps) + # fmt: off @pytest.mark.parametrize( diff --git a/spacy/tests/regression/test_issue5501-6000.py b/spacy/tests/regression/test_issue5501-6000.py new file mode 100644 index 000000000..f0b46cb83 --- /dev/null +++ b/spacy/tests/regression/test_issue5501-6000.py @@ -0,0 +1,76 @@ +from thinc.api import fix_random_seed +from spacy.lang.en import English +from spacy.tokens import Span +from spacy import displacy +from spacy.pipeline import merge_entities + + +def test_issue5551(): + """Test that after fixing the random seed, the results of the pipeline are truly identical""" + component = "textcat" + pipe_cfg = { + "model": { + "@architectures": "spacy.TextCatBOW.v1", + "exclusive_classes": True, + "ngram_size": 2, + "no_output_layer": False, + } + } + results = [] + for i in range(3): + fix_random_seed(0) + nlp = English() + example = ( + "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.", + {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}}, + ) + pipe = nlp.add_pipe(component, config=pipe_cfg, last=True) + for label in set(example[1]["cats"]): + pipe.add_label(label) + nlp.initialize() + # Store the result of each iteration + result = pipe.model.predict([nlp.make_doc(example[0])]) + results.append(list(result[0])) + # All results should be the same because of the fixed seed + assert len(results) == 3 + assert results[0] == results[1] + assert results[0] == results[2] + + +def test_issue5838(): + # Displacy's EntityRenderer break line + # not working after last entity + sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n" + nlp = English() + doc = nlp(sample_text) + doc.ents = [Span(doc, 7, 8, label="test")] + html = displacy.render(doc, style="ent") + found = html.count("
") + assert found == 4 + + +def test_issue5918(): + # Test edge case when merging entities. + nlp = English() + ruler = nlp.add_pipe("entity_ruler") + patterns = [ + {"label": "ORG", "pattern": "Digicon Inc"}, + {"label": "ORG", "pattern": "Rotan Mosle Inc's"}, + {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"}, + ] + ruler.add_patterns(patterns) + + text = """ + Digicon Inc said it has completed the previously-announced disposition + of its computer systems division to an investment group led by + Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate. + """ + doc = nlp(text) + assert len(doc.ents) == 3 + # make it so that the third span's head is within the entity (ent_iob=I) + # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents. + # TODO: test for logging here + # with pytest.warns(UserWarning): + # doc[29].head = doc[33] + doc = merge_entities(doc) + assert len(doc.ents) == 3 diff --git a/spacy/tests/regression/test_issue5551.py b/spacy/tests/regression/test_issue5551.py deleted file mode 100644 index 655764362..000000000 --- a/spacy/tests/regression/test_issue5551.py +++ /dev/null @@ -1,37 +0,0 @@ -from spacy.lang.en import English -from spacy.util import fix_random_seed - - -def test_issue5551(): - """Test that after fixing the random seed, the results of the pipeline are truly identical""" - component = "textcat" - pipe_cfg = { - "model": { - "@architectures": "spacy.TextCatBOW.v1", - "exclusive_classes": True, - "ngram_size": 2, - "no_output_layer": False, - } - } - - results = [] - for i in range(3): - fix_random_seed(0) - nlp = English() - example = ( - "Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.", - {"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}}, - ) - pipe = nlp.add_pipe(component, config=pipe_cfg, last=True) - for label in set(example[1]["cats"]): - pipe.add_label(label) - nlp.initialize() - - # Store the result of each iteration - result = pipe.model.predict([nlp.make_doc(example[0])]) - results.append(list(result[0])) - - # All results should be the same because of the fixed seed - assert len(results) == 3 - assert results[0] == results[1] - assert results[0] == results[2] diff --git a/spacy/tests/regression/test_issue5838.py b/spacy/tests/regression/test_issue5838.py deleted file mode 100644 index 4e4d98beb..000000000 --- a/spacy/tests/regression/test_issue5838.py +++ /dev/null @@ -1,23 +0,0 @@ -from spacy.lang.en import English -from spacy.tokens import Span -from spacy import displacy - - -SAMPLE_TEXT = """First line -Second line, with ent -Third line -Fourth line -""" - - -def test_issue5838(): - # Displacy's EntityRenderer break line - # not working after last entity - - nlp = English() - doc = nlp(SAMPLE_TEXT) - doc.ents = [Span(doc, 7, 8, label="test")] - - html = displacy.render(doc, style="ent") - found = html.count("
") - assert found == 4 diff --git a/spacy/tests/regression/test_issue5918.py b/spacy/tests/regression/test_issue5918.py deleted file mode 100644 index d25323ef6..000000000 --- a/spacy/tests/regression/test_issue5918.py +++ /dev/null @@ -1,29 +0,0 @@ -from spacy.lang.en import English -from spacy.pipeline import merge_entities - - -def test_issue5918(): - # Test edge case when merging entities. - nlp = English() - ruler = nlp.add_pipe("entity_ruler") - patterns = [ - {"label": "ORG", "pattern": "Digicon Inc"}, - {"label": "ORG", "pattern": "Rotan Mosle Inc's"}, - {"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"}, - ] - ruler.add_patterns(patterns) - - text = """ - Digicon Inc said it has completed the previously-announced disposition - of its computer systems division to an investment group led by - Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate. - """ - doc = nlp(text) - assert len(doc.ents) == 3 - # make it so that the third span's head is within the entity (ent_iob=I) - # bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents. - # TODO: test for logging here - # with pytest.warns(UserWarning): - # doc[29].head = doc[33] - doc = merge_entities(doc) - assert len(doc.ents) == 3 diff --git a/spacy/tests/regression/test_issue5230.py b/spacy/tests/serialize/test_resource_warning.py similarity index 100% rename from spacy/tests/regression/test_issue5230.py rename to spacy/tests/serialize/test_resource_warning.py diff --git a/spacy/training/gold_io.pyx b/spacy/training/gold_io.pyx index 8fb6b8565..327748d01 100644 --- a/spacy/training/gold_io.pyx +++ b/spacy/training/gold_io.pyx @@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"): docs = [docs] json_doc = {"id": doc_id, "paragraphs": []} for i, doc in enumerate(docs): - json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []} + raw = None if doc.has_unknown_spaces else doc.text + json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []} for cat, val in doc.cats.items(): json_cat = {"label": cat, "value": val} json_para["cats"].append(json_cat) diff --git a/spacy/training/loop.py b/spacy/training/loop.py index c3fa83b39..eecb3e273 100644 --- a/spacy/training/loop.py +++ b/spacy/training/loop.py @@ -112,10 +112,10 @@ def train( nlp.to_disk(final_model_path) else: nlp.to_disk(final_model_path) - # This will only run if we don't hit an error - stdout.write( - msg.good("Saved pipeline to output directory", final_model_path) + "\n" - ) + # This will only run if we don't hit an error + stdout.write( + msg.good("Saved pipeline to output directory", final_model_path) + "\n" + ) def train_while_improving( diff --git a/website/docs/usage/_benchmarks-models.md b/website/docs/usage/_benchmarks-models.md index becd313f4..1e755e39d 100644 --- a/website/docs/usage/_benchmarks-models.md +++ b/website/docs/usage/_benchmarks-models.md @@ -1,19 +1,18 @@ import { Help } from 'components/typography'; import Link from 'components/link' - -
-| Pipeline | Parser | Tagger | NER | WPS
CPU words per second on CPU, higher is better | WPS
GPU words per second on GPU, higher is better | -| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: | -| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.7 | 1k | 8k | -| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.8 | 7k | | -| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | | 10k | | +| Pipeline | Parser | Tagger | NER | +| ---------------------------------------------------------- | -----: | -----: | ---: | +| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.4 | +| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.4 | +| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
**Full pipeline accuracy and speed** on the -[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. +[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on +the development set).
@@ -21,14 +20,11 @@ import { Help } from 'components/typography'; import Link from 'components/link'
-| Named Entity Recognition System | OntoNotes | CoNLL '03 | -| ------------------------------------------------------------------------------ | --------: | --------: | -| spaCy RoBERTa (2020) | 89.7 | 91.6 | -| spaCy CNN (2020) | 84.5 | | -| spaCy CNN (2017) | | | -| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)1 | 88.8 | 92.1 | -| Flair2 | 89.7 | 93.1 | -| BERT Base3 | - | 92.4 | +| Named Entity Recognition System | OntoNotes | CoNLL '03 | +| -------------------------------- | --------: | --------: | +| spaCy RoBERTa (2020) | 89.7 | 91.6 | +| Stanza (StanfordNLP)1 | 88.8 | 92.1 | +| Flair2 | 89.7 | 93.1 |
@@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link' [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See [NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for -more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). -**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3. -** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805). +more results. Project template: +[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. ** +[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. ** +[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).
diff --git a/website/docs/usage/facts-figures.md b/website/docs/usage/facts-figures.md index 2707f68fa..269ac5e17 100644 --- a/website/docs/usage/facts-figures.md +++ b/website/docs/usage/facts-figures.md @@ -10,6 +10,18 @@ menu: ## Comparison {#comparison hidden="true"} +spaCy is a **free, open-source library** for advanced **Natural Language +Processing** (NLP) in Python. It's designed specifically for **production use** +and helps you build applications that process and "understand" large volumes of +text. It can be used to build information extraction or natural language +understanding systems. + +### Feature overview {#comparison-features} + +import Features from 'widgets/features.js' + + + ### When should I use spaCy? {#comparison-usage} - ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy @@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md' | Dependency Parsing System | UAS | LAS | | ------------------------------------------------------------------------------ | ---: | ---: | -| spaCy RoBERTa (2020)1 | 95.5 | 94.3 | -| spaCy CNN (2020)1 | | | +| spaCy RoBERTa (2020) | 95.5 | 94.3 | | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 | | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 | @@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md' **Dependency parsing accuracy** on the Penn Treebank. See [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more -results. **1. ** Project template: +results. Project template: [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank). diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index d1a8497d7..131bd8c94 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -489,11 +489,11 @@ This allows you to write callbacks that consider the entire set of matched phrases, so that you can resolve overlaps and other conflicts in whatever way you prefer. -| Argument | Description | -| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | -| `matcher` | The matcher instance. ~~Matcher~~ | -| `doc` | The document the matcher was used on. ~~Doc~~ | -| `i` | Index of the current match (`matches[i`]). ~~int~~ | +| Argument | Description | +| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- | +| `matcher` | The matcher instance. ~~Matcher~~ | +| `doc` | The document the matcher was used on. ~~Doc~~ | +| `i` | Index of the current match (`matches[i`]). ~~int~~ | | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ | ### Creating spans from matches {#matcher-spans} @@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences containing a match and render them with the [displaCy visualizer](/usage/visualizers). In the callback function, you'll have access to the `start` and `end` of each match, as well as the parent `Doc`. This -lets you determine the sentence containing the match, `doc[start:end].sent`, -and calculate the start and end of the matched span within the sentence. Using +lets you determine the sentence containing the match, `doc[start:end].sent`, and +calculate the start and end of the matched span within the sentence. Using displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a list of dictionaries containing the text and entities to render. diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 9191a7db2..d9d636bb1 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md' +#### New trained transformer-based pipelines {#features-transformers-pipelines} + +> #### Notes on model capabilities +> +> The models are each trained with a **single transformer** shared across the +> pipeline, which requires it to be trained on a single corpus. For +> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5 +> corpus, which has annotations across several tasks. For [French](/models/fr), +> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable +> corpus that had both syntactic and entity annotations, so the transformer +> models for those languages do not include NER. + +| Package | Language | Transformer | Tagger | Parser |  NER | +| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: | +| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.0 | 89.4 | +| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - | +| [`es_dep_news_trf`](/models/es#es_dep_news_trf) | Spanish | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 98.2 | 94.6 | - | +| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf) | French | [`camembert-base`](https://huggingface.co/camembert-base) | 95.7 | 94.9 | - | +| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 92.5 | 77.2 | 75.6 | + - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), @@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md' - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), [TransformerListener](/api/architectures#TransformerListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) -- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf), - [`de_dep_news_trf`](/models/de#de_dep_news_trf), - [`es_dep_news_trf`](/models/es#es_dep_news_trf), - [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf), - [`zh_core_web_trf`](/models/zh#zh_core_web_trf) - **Implementation:** [`spacy-transformers`](https://github.com/explosion/spacy-transformers) diff --git a/website/src/widgets/features.js b/website/src/widgets/features.js new file mode 100644 index 000000000..73863d5cc --- /dev/null +++ b/website/src/widgets/features.js @@ -0,0 +1,72 @@ +import React from 'react' +import { graphql, StaticQuery } from 'gatsby' + +import { Ul, Li } from '../components/list' + +export default () => ( + { + const { counts } = site.siteMetadata + return ( +
    +
  • + ✅ Support for {counts.langs}+ languages +
  • +
  • + ✅ {counts.models} trained pipelines for{' '} + {counts.modelLangs} languages +
  • +
  • + ✅ Multi-task learning with pretrained transformers like + BERT +
  • +
  • + ✅ Pretrained word vectors +
  • +
  • ✅ State-of-the-art speed
  • +
  • + ✅ Production-ready training system +
  • +
  • + ✅ Linguistically-motivated tokenization +
  • +
  • + ✅ Components for named entity recognition, part-of-speech + tagging, dependency parsing, sentence segmentation,{' '} + text classification, lemmatization, morphological analysis, + entity linking and more +
  • +
  • + ✅ Easily extensible with custom components and attributes +
  • +
  • + ✅ Support for custom models in PyTorch,{' '} + TensorFlow and other frameworks +
  • +
  • + ✅ Built in visualizers for syntax and NER +
  • +
  • + ✅ Easy model packaging, deployment and workflow management +
  • +
  • ✅ Robust, rigorously evaluated accuracy
  • +
+ ) + }} + /> +) + +const query = graphql` + query FeaturesQuery { + site { + siteMetadata { + counts { + langs + modelLangs + models + } + } + } + } +` diff --git a/website/src/widgets/landing.js b/website/src/widgets/landing.js index 46be93ab5..2cee9460f 100644 --- a/website/src/widgets/landing.js +++ b/website/src/widgets/landing.js @@ -14,13 +14,13 @@ import { LandingBanner, } from '../components/landing' import { H2 } from '../components/typography' -import { Ul, Li } from '../components/list' import { InlineCode } from '../components/code' import Button from '../components/button' import Link from '../components/link' import QuickstartTraining from './quickstart-training' import Project from './project' +import Features from './features' import courseImage from '../../docs/images/course.jpg' import prodigyImage from '../../docs/images/prodigy_overview.jpg' import projectsImage from '../../docs/images/projects.png' @@ -56,7 +56,7 @@ for entity in doc.ents: } const Landing = ({ data }) => { - const { counts, nightly } = data + const { nightly } = data const codeExample = getCodeExample(nightly) return ( <> @@ -98,51 +98,7 @@ const Landing = ({ data }) => {

Features

-
    -
  • - ✅ Support for {counts.langs}+ languages -
  • -
  • - ✅ {counts.models} trained pipelines for{' '} - {counts.modelLangs} languages -
  • -
  • - ✅ Multi-task learning with pretrained transformers{' '} - like BERT -
  • -
  • - ✅ Pretrained word vectors -
  • -
  • ✅ State-of-the-art speed
  • -
  • - ✅ Production-ready training system -
  • -
  • - ✅ Linguistically-motivated tokenization -
  • -
  • - ✅ Components for named entity recognition, - part-of-speech tagging, dependency parsing, sentence segmentation,{' '} - text classification, lemmatization, morphological - analysis, entity linking and more -
  • -
  • - ✅ Easily extensible with custom components and - attributes -
  • -
  • - ✅ Support for custom models in PyTorch,{' '} - TensorFlow and other frameworks -
  • -
  • - ✅ Built in visualizers for syntax and NER -
  • -
  • - ✅ Easy model packaging, deployment and workflow - management -
  • -
  • ✅ Robust, rigorously evaluated accuracy
  • -
+
@@ -333,11 +289,6 @@ const landingQuery = graphql` siteMetadata { nightly repo - counts { - langs - modelLangs - models - } } } }