Merge branch 'develop' into nightly.spacy.io

This commit is contained in:
Ines Montani 2020-10-15 17:27:49 +02:00
commit 7f440275ab
31 changed files with 795 additions and 204 deletions

106
.github/contributors/Nuccy90.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Elena Fano |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2020-09-21 |
| GitHub username | Nuccy90 |
| Website (optional) | |

106
.github/contributors/rahul1990gupta.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Rahul Gupta |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 28 July 2020 |
| GitHub username | rahul1990gupta |
| Website (optional) | |

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy-nightly" __title__ = "spacy-nightly"
__version__ = "3.0.0a41" __version__ = "3.0.0rc1"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -10,23 +10,26 @@ _stem_suffixes = [
["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"], ["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"] ["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
] ]
# fmt: on
# reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system # reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/ # reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
_num_words = [ _one_to_ten = [
"शून्य", "शून्य",
"एक", "एक",
"दो", "दो",
"तीन", "तीन",
"चार", "चार",
"पांच", "पांच", "पाँच",
"छह", "छह",
"सात", "सात",
"आठ", "आठ",
"नौ", "नौ",
"दस", "दस",
]
_eleven_to_beyond = [
"ग्यारह", "ग्यारह",
"बारह", "बारह",
"तेरह", "तेरह",
@ -37,13 +40,85 @@ _num_words = [
"अठारह", "अठारह",
"उन्नीस", "उन्नीस",
"बीस", "बीस",
"इकीस", "इक्कीस",
"बाईस",
"तेइस",
"चौबीस",
"पच्चीस",
"छब्बीस",
"सताइस", "सत्ताइस",
"अट्ठाइस",
"उनतीस",
"तीस", "तीस",
"इकतीस", "इकत्तीस",
"बतीस", "बत्तीस",
"तैंतीस",
"चौंतीस",
"पैंतीस",
"छतीस", "छत्तीस",
"सैंतीस",
"अड़तीस",
"उनतालीस", "उनत्तीस",
"चालीस", "चालीस",
"इकतालीस",
"बयालीस",
"तैतालीस",
"चवालीस",
"पैंतालीस",
"छयालिस",
"सैंतालीस",
"अड़तालीस",
"उनचास",
"पचास", "पचास",
"इक्यावन",
"बावन",
"तिरपन", "तिरेपन",
"चौवन", "चउवन",
"पचपन",
"छप्पन",
"सतावन", "सत्तावन",
"अठावन",
"उनसठ",
"साठ", "साठ",
"इकसठ",
"बासठ",
"तिरसठ", "तिरेसठ",
"चौंसठ",
"पैंसठ",
"छियासठ",
"सड़सठ",
"अड़सठ",
"उनहत्तर",
"सत्तर", "सत्तर",
"इकहत्तर"
"बहत्तर",
"तिहत्तर",
"चौहत्तर",
"पचहत्तर",
"छिहत्तर",
"सतहत्तर",
"अठहत्तर",
"उन्नासी", "उन्यासी"
"अस्सी", "अस्सी",
"इक्यासी",
"बयासी",
"तिरासी",
"चौरासी",
"पचासी",
"छियासी",
"सतासी",
"अट्ठासी",
"नवासी",
"नब्बे", "नब्बे",
"इक्यानवे",
"बानवे",
"तिरानवे",
"चौरानवे",
"पचानवे",
"छियानवे",
"सतानवे",
"अट्ठानवे",
"निन्यानवे",
"सौ", "सौ",
"हज़ार", "हज़ार",
"लाख", "लाख",
@ -52,6 +127,23 @@ _num_words = [
"खरब", "खरब",
] ]
_num_words = _one_to_ten + _eleven_to_beyond
_ordinal_words_one_to_ten = [
"प्रथम", "पहला",
"द्वितीय", "दूसरा",
"तृतीय", "तीसरा",
"चौथा",
"पांचवाँ",
"छठा",
"सातवाँ",
"आठवाँ",
"नौवाँ",
"दसवाँ",
]
_ordinal_suffix = "वाँ"
# fmt: on
def norm(string): def norm(string):
# normalise base exceptions, e.g. punctuation or currency symbols # normalise base exceptions, e.g. punctuation or currency symbols
@ -64,7 +156,7 @@ def norm(string):
for suffix_group in reversed(_stem_suffixes): for suffix_group in reversed(_stem_suffixes):
length = len(suffix_group[0]) length = len(suffix_group[0])
if len(string) <= length: if len(string) <= length:
break continue
for suffix in suffix_group: for suffix in suffix_group:
if string.endswith(suffix): if string.endswith(suffix):
return string[:-length] return string[:-length]
@ -74,7 +166,7 @@ def norm(string):
def like_num(text): def like_num(text):
if text.startswith(("+", "-", "±", "~")): if text.startswith(("+", "-", "±", "~")):
text = text[1:] text = text[1:]
text = text.replace(", ", "").replace(".", "") text = text.replace(",", "").replace(".", "")
if text.isdigit(): if text.isdigit():
return True return True
if text.count("/") == 1: if text.count("/") == 1:
@ -83,6 +175,14 @@ def like_num(text):
return True return True
if text.lower() in _num_words: if text.lower() in _num_words:
return True return True
# check ordinal numbers
# reference: http://www.englishkitab.com/Vocabulary/Numbers.html
if text in _ordinal_words_one_to_ten:
return True
if text.endswith(_ordinal_suffix):
if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
return True
return False return False

View File

@ -19,4 +19,6 @@ sentences = [
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன", "தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது", "நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.", "லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
"என்ன வேலை செய்கிறீர்கள்?",
"எந்த கல்லூரியில் படிக்கிறாய்?",
] ]

View File

@ -73,20 +73,16 @@ def like_num(text):
num, denom = text.split("/") num, denom = text.split("/")
if num.isdigit() and denom.isdigit(): if num.isdigit() and denom.isdigit():
return True return True
text_lower = text.lower() text_lower = text.lower()
# Check cardinal number # Check cardinal number
if text_lower in _num_words: if text_lower in _num_words:
return True return True
# Check ordinal number # Check ordinal number
if text_lower in _ordinal_words: if text_lower in _ordinal_words:
return True return True
if text_lower.endswith(_ordinal_endings): if text_lower.endswith(_ordinal_endings):
if text_lower[:-3].isdigit() or text_lower[:-4].isdigit(): if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
return True return True
return False return False

View File

@ -1,6 +1,3 @@
# coding: utf8
from __future__ import unicode_literals
from ...symbols import NOUN, PROPN, PRON from ...symbols import NOUN, PROPN, PRON
from ...errors import Errors from ...errors import Errors

View File

@ -125,6 +125,11 @@ def he_tokenizer():
return get_lang_class("he")().tokenizer return get_lang_class("he")().tokenizer
@pytest.fixture(scope="session")
def hi_tokenizer():
return get_lang_class("hi")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def hr_tokenizer(): def hr_tokenizer():
return get_lang_class("hr")().tokenizer return get_lang_class("hr")().tokenizer
@ -240,11 +245,6 @@ def tr_tokenizer():
return get_lang_class("tr")().tokenizer return get_lang_class("tr")().tokenizer
@pytest.fixture(scope="session")
def tr_vocab():
return get_lang_class("tr").Defaults.create_vocab()
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def tt_tokenizer(): def tt_tokenizer():
return get_lang_class("tt")().tokenizer return get_lang_class("tt")().tokenizer
@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
"segmenter": "pkuseg", "segmenter": "pkuseg",
} }
}, },
"initialize": { "initialize": {"tokenizer": {"pkuseg_model": "web"}},
"tokenizer": {
"pkuseg_model": "web",
}
},
} }
nlp = get_lang_class("zh").from_config(config) nlp = get_lang_class("zh").from_config(config)
nlp.initialize() nlp.initialize()

View File

View File

@ -0,0 +1,43 @@
import pytest
from spacy.lang.hi.lex_attrs import norm, like_num
def test_hi_tokenizer_handles_long_text(hi_tokenizer):
text = """
कह 1900 दशक शल (ि जयकर) पत चलत ि उसक
, वद (हर ) पस घर रह वद 10 पहल
पढ करन ि गय उसक टन शल अपन पड
रहन ि (िरण ) बत इस खबर
"""
tokens = hi_tokenizer(text)
assert len(tokens) == 86
@pytest.mark.parametrize(
"word,word_norm",
[
("चलता", "चल"),
("पढ़ाई", "पढ़"),
("देती", "दे"),
("जाती", ""),
("मुस्कुराकर", "मुस्कुर"),
],
)
def test_hi_norm(word, word_norm):
assert norm(word) == word_norm
@pytest.mark.parametrize(
"word",
["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
)
def test_hi_like_num(word):
assert like_num(word)
@pytest.mark.parametrize(
"word",
["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"],
)
def test_hi_like_num_ordinal_words(word):
assert like_num(word)

View File

@ -1,4 +1,7 @@
import pytest import pytest
from numpy.testing import assert_equal
from spacy.attrs import ENT_IOB
from spacy import util from spacy import util
from spacy.lang.en import English from spacy.lang.en import English
from spacy.language import Language from spacy.language import Language
@ -332,6 +335,19 @@ def test_overfitting_IO():
assert ents2[0].text == "London" assert ents2[0].text == "London"
assert ents2[0].label_ == "LOC" assert ents2[0].label_ == "LOC"
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Just a sentence.",
"Then one more sentence about London.",
"Here is another one.",
"I like London.",
]
batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
def test_ner_warns_no_lookups(caplog): def test_ner_warns_no_lookups(caplog):
nlp = English() nlp = English()

View File

@ -1,4 +1,7 @@
import pytest import pytest
from numpy.testing import assert_equal
from spacy.attrs import DEP
from spacy.lang.en import English from spacy.lang.en import English
from spacy.training import Example from spacy.training import Example
from spacy.tokens import Doc from spacy.tokens import Doc
@ -210,3 +213,16 @@ def test_overfitting_IO():
assert doc2[0].dep_ == "nsubj" assert doc2[0].dep_ == "nsubj"
assert doc2[2].dep_ == "dobj" assert doc2[2].dep_ == "dobj"
assert doc2[3].dep_ == "punct" assert doc2[3].dep_ == "punct"
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Just a sentence.",
"Then one more sentence about London.",
"Here is another one.",
"I like London.",
]
batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)

View File

@ -1,5 +1,7 @@
from typing import Callable, Iterable from typing import Callable, Iterable
import pytest import pytest
from numpy.testing import assert_equal
from spacy.attrs import ENT_KB_ID
from spacy.kb import KnowledgeBase, get_candidates, Candidate from spacy.kb import KnowledgeBase, get_candidates, Candidate
from spacy.vocab import Vocab from spacy.vocab import Vocab
@ -496,6 +498,19 @@ def test_overfitting_IO():
predictions.append(ent.kb_id_) predictions.append(ent.kb_id_)
assert predictions == GOLD_entities assert predictions == GOLD_entities
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Russ Cochran captured his first major title with his son as caddie.",
"Russ Cochran his reprints include EC Comics.",
"Russ Cochran has been publishing comic art.",
"Russ Cochran was a member of University of Kentucky's golf team.",
]
batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
def test_kb_serialization(): def test_kb_serialization():
# Test that the KB can be used in a pipeline with a different vocab # Test that the KB can be used in a pipeline with a different vocab

View File

@ -0,0 +1,107 @@
from typing import List
import numpy
import pytest
from numpy.testing import assert_almost_equal
from spacy.vocab import Vocab
from thinc.api import NumpyOps, Model, data_validation
from thinc.types import Array2d, Ragged
from spacy.lang.en import English
from spacy.ml import FeatureExtractor, StaticVectors
from spacy.ml._character_embed import CharacterEmbed
from spacy.tokens import Doc
OPS = NumpyOps()
texts = ["These are 4 words", "Here just three"]
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
l1 = [[9, 8], [7, 6], [5, 4]]
list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")]
list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")]
array = OPS.xp.asarray(l1, dtype="f")
ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i"))
def get_docs():
vocab = Vocab()
for t in texts:
for word in t.split():
hash_id = vocab.strings.add(word)
vector = numpy.random.uniform(-1, 1, (7,))
vocab.set_vector(hash_id, vector)
docs = [English(vocab)(t) for t in texts]
return docs
# Test components with a model of type Model[List[Doc], List[Floats2d]]
@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"])
def test_components_batching_list(name):
nlp = English()
proc = nlp.create_pipe(name)
util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats)
# Test components with a model of type Model[List[Doc], Floats2d]
@pytest.mark.parametrize("name", ["textcat"])
def test_components_batching_array(name):
nlp = English()
proc = nlp.create_pipe(name)
util_batch_unbatch_docs_array(proc.model, get_docs(), array)
LAYERS = [
(CharacterEmbed(nM=5, nC=3), get_docs(), list_floats),
(FeatureExtractor([100, 200]), get_docs(), list_ints),
(StaticVectors(), get_docs(), ragged),
]
@pytest.mark.parametrize("model,in_data,out_data", LAYERS)
def test_layers_batching_all(model, in_data, out_data):
# In = List[Doc]
if isinstance(in_data, list) and isinstance(in_data[0], Doc):
if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2:
util_batch_unbatch_docs_array(model, in_data, out_data)
elif (
isinstance(out_data, list)
and isinstance(out_data[0], OPS.xp.ndarray)
and out_data[0].ndim == 2
):
util_batch_unbatch_docs_list(model, in_data, out_data)
elif isinstance(out_data, Ragged):
util_batch_unbatch_docs_ragged(model, in_data, out_data)
def util_batch_unbatch_docs_list(
model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d]
):
with data_validation(True):
model.initialize(in_data, out_data)
Y_batched = model.predict(in_data)
Y_not_batched = [model.predict([u])[0] for u in in_data]
for i in range(len(Y_batched)):
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
def util_batch_unbatch_docs_array(
model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d
):
with data_validation(True):
model.initialize(in_data, out_data)
Y_batched = model.predict(in_data).tolist()
Y_not_batched = [model.predict([u])[0] for u in in_data]
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
def util_batch_unbatch_docs_ragged(
model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged
):
with data_validation(True):
model.initialize(in_data, out_data)
Y_batched = model.predict(in_data)
Y_not_batched = []
for u in in_data:
Y_not_batched.extend(model.predict([u]).data.tolist())
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)

View File

@ -1,4 +1,5 @@
import pytest import pytest
from numpy.testing import assert_equal
from spacy import util from spacy import util
from spacy.training import Example from spacy.training import Example
@ -6,6 +7,7 @@ from spacy.lang.en import English
from spacy.language import Language from spacy.language import Language
from spacy.tests.util import make_tempdir from spacy.tests.util import make_tempdir
from spacy.morphology import Morphology from spacy.morphology import Morphology
from spacy.attrs import MORPH
def test_label_types(): def test_label_types():
@ -101,3 +103,16 @@ def test_overfitting_IO():
doc2 = nlp2(test_text) doc2 = nlp2(test_text)
assert [str(t.morph) for t in doc2] == gold_morphs assert [str(t.morph) for t in doc2] == gold_morphs
assert [t.pos_ for t in doc2] == gold_pos_tags assert [t.pos_ for t in doc2] == gold_pos_tags
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Just a sentence.",
"Then one more sentence about London.",
"Here is another one.",
"I like London.",
]
batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)

View File

@ -1,4 +1,6 @@
import pytest import pytest
from numpy.testing import assert_equal
from spacy.attrs import SENT_START
from spacy import util from spacy import util
from spacy.training import Example from spacy.training import Example
@ -80,3 +82,18 @@ def test_overfitting_IO():
nlp2 = util.load_model_from_path(tmp_dir) nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text) doc2 = nlp2(test_text)
assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Just a sentence.",
"Then one more sentence about London.",
"Here is another one.",
"I like London.",
]
batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
no_batch_deps = [
doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts]
]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)

View File

@ -1,4 +1,7 @@
import pytest import pytest
from numpy.testing import assert_equal
from spacy.attrs import TAG
from spacy import util from spacy import util
from spacy.training import Example from spacy.training import Example
from spacy.lang.en import English from spacy.lang.en import English
@ -117,6 +120,19 @@ def test_overfitting_IO():
assert doc2[2].tag_ is "J" assert doc2[2].tag_ is "J"
assert doc2[3].tag_ is "N" assert doc2[3].tag_ is "N"
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = [
"Just a sentence.",
"I like green eggs.",
"Here is another one.",
"I eat ham.",
]
batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
def test_tagger_requires_labels(): def test_tagger_requires_labels():
nlp = English() nlp = English()

View File

@ -1,6 +1,7 @@
import pytest import pytest
import random import random
import numpy.random import numpy.random
from numpy.testing import assert_equal
from thinc.api import fix_random_seed from thinc.api import fix_random_seed
from spacy import util from spacy import util
from spacy.lang.en import English from spacy.lang.en import English
@ -174,6 +175,14 @@ def test_overfitting_IO():
assert scores["cats_score"] == 1.0 assert scores["cats_score"] == 1.0
assert "cats_score_desc" in scores assert "cats_score_desc" in scores
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
assert_equal(batch_deps_1, batch_deps_2)
assert_equal(batch_deps_1, no_batch_deps)
# fmt: off # fmt: off
@pytest.mark.parametrize( @pytest.mark.parametrize(

View File

@ -0,0 +1,76 @@
from thinc.api import fix_random_seed
from spacy.lang.en import English
from spacy.tokens import Span
from spacy import displacy
from spacy.pipeline import merge_entities
def test_issue5551():
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
component = "textcat"
pipe_cfg = {
"model": {
"@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": True,
"ngram_size": 2,
"no_output_layer": False,
}
}
results = []
for i in range(3):
fix_random_seed(0)
nlp = English()
example = (
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
)
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
for label in set(example[1]["cats"]):
pipe.add_label(label)
nlp.initialize()
# Store the result of each iteration
result = pipe.model.predict([nlp.make_doc(example[0])])
results.append(list(result[0]))
# All results should be the same because of the fixed seed
assert len(results) == 3
assert results[0] == results[1]
assert results[0] == results[2]
def test_issue5838():
# Displacy's EntityRenderer break line
# not working after last entity
sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n"
nlp = English()
doc = nlp(sample_text)
doc.ents = [Span(doc, 7, 8, label="test")]
html = displacy.render(doc, style="ent")
found = html.count("</br>")
assert found == 4
def test_issue5918():
# Test edge case when merging entities.
nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "ORG", "pattern": "Digicon Inc"},
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
]
ruler.add_patterns(patterns)
text = """
Digicon Inc said it has completed the previously-announced disposition
of its computer systems division to an investment group led by
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
"""
doc = nlp(text)
assert len(doc.ents) == 3
# make it so that the third span's head is within the entity (ent_iob=I)
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
# TODO: test for logging here
# with pytest.warns(UserWarning):
# doc[29].head = doc[33]
doc = merge_entities(doc)
assert len(doc.ents) == 3

View File

@ -1,37 +0,0 @@
from spacy.lang.en import English
from spacy.util import fix_random_seed
def test_issue5551():
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
component = "textcat"
pipe_cfg = {
"model": {
"@architectures": "spacy.TextCatBOW.v1",
"exclusive_classes": True,
"ngram_size": 2,
"no_output_layer": False,
}
}
results = []
for i in range(3):
fix_random_seed(0)
nlp = English()
example = (
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
)
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
for label in set(example[1]["cats"]):
pipe.add_label(label)
nlp.initialize()
# Store the result of each iteration
result = pipe.model.predict([nlp.make_doc(example[0])])
results.append(list(result[0]))
# All results should be the same because of the fixed seed
assert len(results) == 3
assert results[0] == results[1]
assert results[0] == results[2]

View File

@ -1,23 +0,0 @@
from spacy.lang.en import English
from spacy.tokens import Span
from spacy import displacy
SAMPLE_TEXT = """First line
Second line, with ent
Third line
Fourth line
"""
def test_issue5838():
# Displacy's EntityRenderer break line
# not working after last entity
nlp = English()
doc = nlp(SAMPLE_TEXT)
doc.ents = [Span(doc, 7, 8, label="test")]
html = displacy.render(doc, style="ent")
found = html.count("</br>")
assert found == 4

View File

@ -1,29 +0,0 @@
from spacy.lang.en import English
from spacy.pipeline import merge_entities
def test_issue5918():
# Test edge case when merging entities.
nlp = English()
ruler = nlp.add_pipe("entity_ruler")
patterns = [
{"label": "ORG", "pattern": "Digicon Inc"},
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
]
ruler.add_patterns(patterns)
text = """
Digicon Inc said it has completed the previously-announced disposition
of its computer systems division to an investment group led by
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
"""
doc = nlp(text)
assert len(doc.ents) == 3
# make it so that the third span's head is within the entity (ent_iob=I)
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
# TODO: test for logging here
# with pytest.warns(UserWarning):
# doc[29].head = doc[33]
doc = merge_entities(doc)
assert len(doc.ents) == 3

View File

@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
docs = [docs] docs = [docs]
json_doc = {"id": doc_id, "paragraphs": []} json_doc = {"id": doc_id, "paragraphs": []}
for i, doc in enumerate(docs): for i, doc in enumerate(docs):
json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []} raw = None if doc.has_unknown_spaces else doc.text
json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []}
for cat, val in doc.cats.items(): for cat, val in doc.cats.items():
json_cat = {"label": cat, "value": val} json_cat = {"label": cat, "value": val}
json_para["cats"].append(json_cat) json_para["cats"].append(json_cat)

View File

@ -112,10 +112,10 @@ def train(
nlp.to_disk(final_model_path) nlp.to_disk(final_model_path)
else: else:
nlp.to_disk(final_model_path) nlp.to_disk(final_model_path)
# This will only run if we don't hit an error # This will only run if we don't hit an error
stdout.write( stdout.write(
msg.good("Saved pipeline to output directory", final_model_path) + "\n" msg.good("Saved pipeline to output directory", final_model_path) + "\n"
) )
def train_while_improving( def train_while_improving(

View File

@ -1,19 +1,18 @@
import { Help } from 'components/typography'; import Link from 'components/link' import { Help } from 'components/typography'; import Link from 'components/link'
<!-- TODO: update speed and v2 NER numbers -->
<figure> <figure>
| Pipeline | Parser | Tagger | NER | WPS<br />CPU <Help>words per second on CPU, higher is better</Help> | WPS<br/>GPU <Help>words per second on GPU, higher is better</Help> | | Pipeline | Parser | Tagger | NER |
| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: | | ---------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.7 | 1k | 8k | | [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.4 |
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.8 | 7k | | | [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.4 |
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | | 10k | | | `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
<figcaption class="caption"> <figcaption class="caption">
**Full pipeline accuracy and speed** on the **Full pipeline accuracy and speed** on the
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus. [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on
the development set).
</figcaption> </figcaption>
@ -21,14 +20,11 @@ import { Help } from 'components/typography'; import Link from 'components/link'
<figure> <figure>
| Named Entity Recognition System | OntoNotes | CoNLL '03 | | Named Entity Recognition System | OntoNotes | CoNLL '03 |
| ------------------------------------------------------------------------------ | --------: | --------: | | -------------------------------- | --------: | --------: |
| spaCy RoBERTa (2020) | 89.7 | 91.6 | | spaCy RoBERTa (2020) | 89.7 | 91.6 |
| spaCy CNN (2020) | 84.5 | | | Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
| spaCy CNN (2017) | | | | Flair<sup>2</sup> | 89.7 | 93.1 |
| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
| <Link to="https://github.com/flairNLP/flair" hideIcon>Flair</Link><sup>2</sup> | 89.7 | 93.1 |
| BERT Base<sup>3</sup> | - | 92.4 |
<figcaption class="caption"> <figcaption class="caption">
@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and [OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
[CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See [CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
[NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for [NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). more results. Project template:
**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3. [`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. **
** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805). [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. **
[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).
</figcaption> </figcaption>

View File

@ -10,6 +10,18 @@ menu:
## Comparison {#comparison hidden="true"} ## Comparison {#comparison hidden="true"}
spaCy is a **free, open-source library** for advanced **Natural Language
Processing** (NLP) in Python. It's designed specifically for **production use**
and helps you build applications that process and "understand" large volumes of
text. It can be used to build information extraction or natural language
understanding systems.
### Feature overview {#comparison-features}
import Features from 'widgets/features.js'
<Features />
### When should I use spaCy? {#comparison-usage} ### When should I use spaCy? {#comparison-usage}
- ✅ **I'm a beginner and just getting started with NLP.** spaCy makes it easy - ✅ **I'm a beginner and just getting started with NLP.** spaCy makes it easy
@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
| Dependency Parsing System | UAS | LAS | | Dependency Parsing System | UAS | LAS |
| ------------------------------------------------------------------------------ | ---: | ---: | | ------------------------------------------------------------------------------ | ---: | ---: |
| spaCy RoBERTa (2020)<sup>1</sup> | 95.5 | 94.3 | | spaCy RoBERTa (2020) | 95.5 | 94.3 |
| spaCy CNN (2020)<sup>1</sup> | | |
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 | | [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 | | [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
**Dependency parsing accuracy** on the Penn Treebank. See **Dependency parsing accuracy** on the Penn Treebank. See
[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more [NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
results. **1. ** Project template: results. Project template:
[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank). [`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
</figcaption> </figcaption>

View File

@ -489,11 +489,11 @@ This allows you to write callbacks that consider the entire set of matched
phrases, so that you can resolve overlaps and other conflicts in whatever way phrases, so that you can resolve overlaps and other conflicts in whatever way
you prefer. you prefer.
| Argument | Description | | Argument | Description |
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- | | --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| `matcher` | The matcher instance. ~~Matcher~~ | | `matcher` | The matcher instance. ~~Matcher~~ |
| `doc` | The document the matcher was used on. ~~Doc~~ | | `doc` | The document the matcher was used on. ~~Doc~~ |
| `i` | Index of the current match (`matches[i`]). ~~int~~ | | `i` | Index of the current match (`matches[i`]). ~~int~~ |
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ | | `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
### Creating spans from matches {#matcher-spans} ### Creating spans from matches {#matcher-spans}
@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
containing a match and render them with the containing a match and render them with the
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have [displaCy visualizer](/usage/visualizers). In the callback function, you'll have
access to the `start` and `end` of each match, as well as the parent `Doc`. This access to the `start` and `end` of each match, as well as the parent `Doc`. This
lets you determine the sentence containing the match, `doc[start:end].sent`, lets you determine the sentence containing the match, `doc[start:end].sent`, and
and calculate the start and end of the matched span within the sentence. Using calculate the start and end of the matched span within the sentence. Using
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
list of dictionaries containing the text and entities to render. list of dictionaries containing the text and entities to render.

View File

@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
<Benchmarks /> <Benchmarks />
#### New trained transformer-based pipelines {#features-transformers-pipelines}
> #### Notes on model capabilities
>
> The models are each trained with a **single transformer** shared across the
> pipeline, which requires it to be trained on a single corpus. For
> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
> corpus, which has annotations across several tasks. For [French](/models/fr),
> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
> corpus that had both syntactic and entity annotations, so the transformer
> models for those languages do not include NER.
| Package | Language | Transformer | Tagger | Parser |  NER |
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.0 | 89.4 |
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
| [`es_dep_news_trf`](/models/es#es_dep_news_trf) | Spanish | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 98.2 | 94.6 | - |
| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf) | French | [`camembert-base`](https://huggingface.co/camembert-base) | 95.7 | 94.9 | - |
| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 92.5 | 77.2 | 75.6 |
<Infobox title="Details & Documentation" emoji="📖" list> <Infobox title="Details & Documentation" emoji="📖" list>
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[TransformerListener](/api/architectures#TransformerListener), [TransformerListener](/api/architectures#TransformerListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer) [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf),
[`de_dep_news_trf`](/models/de#de_dep_news_trf),
[`es_dep_news_trf`](/models/es#es_dep_news_trf),
[`fr_dep_news_trf`](/models/fr#fr_dep_news_trf),
[`zh_core_web_trf`](/models/zh#zh_core_web_trf)
- **Implementation:** - **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers) [`spacy-transformers`](https://github.com/explosion/spacy-transformers)

View File

@ -0,0 +1,72 @@
import React from 'react'
import { graphql, StaticQuery } from 'gatsby'
import { Ul, Li } from '../components/list'
export default () => (
<StaticQuery
query={query}
render={({ site }) => {
const { counts } = site.siteMetadata
return (
<Ul>
<Li>
Support for <strong>{counts.langs}+ languages</strong>
</Li>
<Li>
<strong>{counts.models} trained pipelines</strong> for{' '}
{counts.modelLangs} languages
</Li>
<Li>
Multi-task learning with pretrained <strong>transformers</strong> like
BERT
</Li>
<Li>
Pretrained <strong>word vectors</strong>
</Li>
<Li> State-of-the-art speed</Li>
<Li>
Production-ready <strong>training system</strong>
</Li>
<Li>
Linguistically-motivated <strong>tokenization</strong>
</Li>
<Li>
Components for <strong>named entity</strong> recognition, part-of-speech
tagging, dependency parsing, sentence segmentation,{' '}
<strong>text classification</strong>, lemmatization, morphological analysis,
entity linking and more
</Li>
<Li>
Easily extensible with <strong>custom components</strong> and attributes
</Li>
<Li>
Support for custom models in <strong>PyTorch</strong>,{' '}
<strong>TensorFlow</strong> and other frameworks
</Li>
<Li>
Built in <strong>visualizers</strong> for syntax and NER
</Li>
<Li>
Easy <strong>model packaging</strong>, deployment and workflow management
</Li>
<Li> Robust, rigorously evaluated accuracy</Li>
</Ul>
)
}}
/>
)
const query = graphql`
query FeaturesQuery {
site {
siteMetadata {
counts {
langs
modelLangs
models
}
}
}
}
`

View File

@ -14,13 +14,13 @@ import {
LandingBanner, LandingBanner,
} from '../components/landing' } from '../components/landing'
import { H2 } from '../components/typography' import { H2 } from '../components/typography'
import { Ul, Li } from '../components/list'
import { InlineCode } from '../components/code' import { InlineCode } from '../components/code'
import Button from '../components/button' import Button from '../components/button'
import Link from '../components/link' import Link from '../components/link'
import QuickstartTraining from './quickstart-training' import QuickstartTraining from './quickstart-training'
import Project from './project' import Project from './project'
import Features from './features'
import courseImage from '../../docs/images/course.jpg' import courseImage from '../../docs/images/course.jpg'
import prodigyImage from '../../docs/images/prodigy_overview.jpg' import prodigyImage from '../../docs/images/prodigy_overview.jpg'
import projectsImage from '../../docs/images/projects.png' import projectsImage from '../../docs/images/projects.png'
@ -56,7 +56,7 @@ for entity in doc.ents:
} }
const Landing = ({ data }) => { const Landing = ({ data }) => {
const { counts, nightly } = data const { nightly } = data
const codeExample = getCodeExample(nightly) const codeExample = getCodeExample(nightly)
return ( return (
<> <>
@ -98,51 +98,7 @@ const Landing = ({ data }) => {
<LandingCol> <LandingCol>
<H2>Features</H2> <H2>Features</H2>
<Ul> <Features />
<Li>
Support for <strong>{counts.langs}+ languages</strong>
</Li>
<Li>
<strong>{counts.models} trained pipelines</strong> for{' '}
{counts.modelLangs} languages
</Li>
<Li>
Multi-task learning with pretrained <strong>transformers</strong>{' '}
like BERT
</Li>
<Li>
Pretrained <strong>word vectors</strong>
</Li>
<Li> State-of-the-art speed</Li>
<Li>
Production-ready <strong>training system</strong>
</Li>
<Li>
Linguistically-motivated <strong>tokenization</strong>
</Li>
<Li>
Components for <strong>named entity</strong> recognition,
part-of-speech tagging, dependency parsing, sentence segmentation,{' '}
<strong>text classification</strong>, lemmatization, morphological
analysis, entity linking and more
</Li>
<Li>
Easily extensible with <strong>custom components</strong> and
attributes
</Li>
<Li>
Support for custom models in <strong>PyTorch</strong>,{' '}
<strong>TensorFlow</strong> and other frameworks
</Li>
<Li>
Built in <strong>visualizers</strong> for syntax and NER
</Li>
<Li>
Easy <strong>model packaging</strong>, deployment and workflow
management
</Li>
<Li> Robust, rigorously evaluated accuracy</Li>
</Ul>
</LandingCol> </LandingCol>
</LandingGrid> </LandingGrid>
@ -333,11 +289,6 @@ const landingQuery = graphql`
siteMetadata { siteMetadata {
nightly nightly
repo repo
counts {
langs
modelLangs
models
}
} }
} }
} }