mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-11 09:00:36 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
7f440275ab
106
.github/contributors/Nuccy90.md
vendored
Normal file
106
.github/contributors/Nuccy90.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Elena Fano |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2020-09-21 |
|
||||
| GitHub username | Nuccy90 |
|
||||
| Website (optional) | |
|
106
.github/contributors/rahul1990gupta.md
vendored
Normal file
106
.github/contributors/rahul1990gupta.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Rahul Gupta |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 28 July 2020 |
|
||||
| GitHub username | rahul1990gupta |
|
||||
| Website (optional) | |
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy-nightly"
|
||||
__version__ = "3.0.0a41"
|
||||
__version__ = "3.0.0rc1"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -10,23 +10,26 @@ _stem_suffixes = [
|
|||
["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
|
||||
["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
|
||||
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
|
||||
# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
|
||||
|
||||
_num_words = [
|
||||
_one_to_ten = [
|
||||
"शून्य",
|
||||
"एक",
|
||||
"दो",
|
||||
"तीन",
|
||||
"चार",
|
||||
"पांच",
|
||||
"पांच", "पाँच",
|
||||
"छह",
|
||||
"सात",
|
||||
"आठ",
|
||||
"नौ",
|
||||
"दस",
|
||||
]
|
||||
|
||||
_eleven_to_beyond = [
|
||||
"ग्यारह",
|
||||
"बारह",
|
||||
"तेरह",
|
||||
|
@ -37,13 +40,85 @@ _num_words = [
|
|||
"अठारह",
|
||||
"उन्नीस",
|
||||
"बीस",
|
||||
"इकीस", "इक्कीस",
|
||||
"बाईस",
|
||||
"तेइस",
|
||||
"चौबीस",
|
||||
"पच्चीस",
|
||||
"छब्बीस",
|
||||
"सताइस", "सत्ताइस",
|
||||
"अट्ठाइस",
|
||||
"उनतीस",
|
||||
"तीस",
|
||||
"इकतीस", "इकत्तीस",
|
||||
"बतीस", "बत्तीस",
|
||||
"तैंतीस",
|
||||
"चौंतीस",
|
||||
"पैंतीस",
|
||||
"छतीस", "छत्तीस",
|
||||
"सैंतीस",
|
||||
"अड़तीस",
|
||||
"उनतालीस", "उनत्तीस",
|
||||
"चालीस",
|
||||
"इकतालीस",
|
||||
"बयालीस",
|
||||
"तैतालीस",
|
||||
"चवालीस",
|
||||
"पैंतालीस",
|
||||
"छयालिस",
|
||||
"सैंतालीस",
|
||||
"अड़तालीस",
|
||||
"उनचास",
|
||||
"पचास",
|
||||
"इक्यावन",
|
||||
"बावन",
|
||||
"तिरपन", "तिरेपन",
|
||||
"चौवन", "चउवन",
|
||||
"पचपन",
|
||||
"छप्पन",
|
||||
"सतावन", "सत्तावन",
|
||||
"अठावन",
|
||||
"उनसठ",
|
||||
"साठ",
|
||||
"इकसठ",
|
||||
"बासठ",
|
||||
"तिरसठ", "तिरेसठ",
|
||||
"चौंसठ",
|
||||
"पैंसठ",
|
||||
"छियासठ",
|
||||
"सड़सठ",
|
||||
"अड़सठ",
|
||||
"उनहत्तर",
|
||||
"सत्तर",
|
||||
"इकहत्तर"
|
||||
"बहत्तर",
|
||||
"तिहत्तर",
|
||||
"चौहत्तर",
|
||||
"पचहत्तर",
|
||||
"छिहत्तर",
|
||||
"सतहत्तर",
|
||||
"अठहत्तर",
|
||||
"उन्नासी", "उन्यासी"
|
||||
"अस्सी",
|
||||
"इक्यासी",
|
||||
"बयासी",
|
||||
"तिरासी",
|
||||
"चौरासी",
|
||||
"पचासी",
|
||||
"छियासी",
|
||||
"सतासी",
|
||||
"अट्ठासी",
|
||||
"नवासी",
|
||||
"नब्बे",
|
||||
"इक्यानवे",
|
||||
"बानवे",
|
||||
"तिरानवे",
|
||||
"चौरानवे",
|
||||
"पचानवे",
|
||||
"छियानवे",
|
||||
"सतानवे",
|
||||
"अट्ठानवे",
|
||||
"निन्यानवे",
|
||||
"सौ",
|
||||
"हज़ार",
|
||||
"लाख",
|
||||
|
@ -52,6 +127,23 @@ _num_words = [
|
|||
"खरब",
|
||||
]
|
||||
|
||||
_num_words = _one_to_ten + _eleven_to_beyond
|
||||
|
||||
_ordinal_words_one_to_ten = [
|
||||
"प्रथम", "पहला",
|
||||
"द्वितीय", "दूसरा",
|
||||
"तृतीय", "तीसरा",
|
||||
"चौथा",
|
||||
"पांचवाँ",
|
||||
"छठा",
|
||||
"सातवाँ",
|
||||
"आठवाँ",
|
||||
"नौवाँ",
|
||||
"दसवाँ",
|
||||
]
|
||||
_ordinal_suffix = "वाँ"
|
||||
# fmt: on
|
||||
|
||||
|
||||
def norm(string):
|
||||
# normalise base exceptions, e.g. punctuation or currency symbols
|
||||
|
@ -64,7 +156,7 @@ def norm(string):
|
|||
for suffix_group in reversed(_stem_suffixes):
|
||||
length = len(suffix_group[0])
|
||||
if len(string) <= length:
|
||||
break
|
||||
continue
|
||||
for suffix in suffix_group:
|
||||
if string.endswith(suffix):
|
||||
return string[:-length]
|
||||
|
@ -83,6 +175,14 @@ def like_num(text):
|
|||
return True
|
||||
if text.lower() in _num_words:
|
||||
return True
|
||||
|
||||
# check ordinal numbers
|
||||
# reference: http://www.englishkitab.com/Vocabulary/Numbers.html
|
||||
if text in _ordinal_words_one_to_ten:
|
||||
return True
|
||||
if text.endswith(_ordinal_suffix):
|
||||
if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
|
|
|
@ -19,4 +19,6 @@ sentences = [
|
|||
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
|
||||
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
|
||||
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
|
||||
"என்ன வேலை செய்கிறீர்கள்?",
|
||||
"எந்த கல்லூரியில் படிக்கிறாய்?",
|
||||
]
|
||||
|
|
|
@ -73,20 +73,16 @@ def like_num(text):
|
|||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
|
||||
text_lower = text.lower()
|
||||
|
||||
# Check cardinal number
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
|
||||
# Check ordinal number
|
||||
if text_lower in _ordinal_words:
|
||||
return True
|
||||
if text_lower.endswith(_ordinal_endings):
|
||||
if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,3 @@
|
|||
# coding: utf8
|
||||
from __future__ import unicode_literals
|
||||
|
||||
from ...symbols import NOUN, PROPN, PRON
|
||||
from ...errors import Errors
|
||||
|
||||
|
|
|
@ -125,6 +125,11 @@ def he_tokenizer():
|
|||
return get_lang_class("he")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def hi_tokenizer():
|
||||
return get_lang_class("hi")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def hr_tokenizer():
|
||||
return get_lang_class("hr")().tokenizer
|
||||
|
@ -240,11 +245,6 @@ def tr_tokenizer():
|
|||
return get_lang_class("tr")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def tr_vocab():
|
||||
return get_lang_class("tr").Defaults.create_vocab()
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def tt_tokenizer():
|
||||
return get_lang_class("tt")().tokenizer
|
||||
|
@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
|
|||
"segmenter": "pkuseg",
|
||||
}
|
||||
},
|
||||
"initialize": {
|
||||
"tokenizer": {
|
||||
"pkuseg_model": "web",
|
||||
}
|
||||
},
|
||||
"initialize": {"tokenizer": {"pkuseg_model": "web"}},
|
||||
}
|
||||
nlp = get_lang_class("zh").from_config(config)
|
||||
nlp.initialize()
|
||||
|
|
0
spacy/tests/lang/hi/__init__.py
Normal file
0
spacy/tests/lang/hi/__init__.py
Normal file
43
spacy/tests/lang/hi/test_lex_attrs.py
Normal file
43
spacy/tests/lang/hi/test_lex_attrs.py
Normal file
|
@ -0,0 +1,43 @@
|
|||
import pytest
|
||||
from spacy.lang.hi.lex_attrs import norm, like_num
|
||||
|
||||
|
||||
def test_hi_tokenizer_handles_long_text(hi_tokenizer):
|
||||
text = """
|
||||
ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका
|
||||
छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की
|
||||
पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस
|
||||
में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है।
|
||||
"""
|
||||
tokens = hi_tokenizer(text)
|
||||
assert len(tokens) == 86
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word,word_norm",
|
||||
[
|
||||
("चलता", "चल"),
|
||||
("पढ़ाई", "पढ़"),
|
||||
("देती", "दे"),
|
||||
("जाती", "ज"),
|
||||
("मुस्कुराकर", "मुस्कुर"),
|
||||
],
|
||||
)
|
||||
def test_hi_norm(word, word_norm):
|
||||
assert norm(word) == word_norm
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
|
||||
)
|
||||
def test_hi_like_num(word):
|
||||
assert like_num(word)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"word",
|
||||
["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"],
|
||||
)
|
||||
def test_hi_like_num_ordinal_words(word):
|
||||
assert like_num(word)
|
|
@ -1,4 +1,7 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from spacy.attrs import ENT_IOB
|
||||
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
|
@ -332,6 +335,19 @@ def test_overfitting_IO():
|
|||
assert ents2[0].text == "London"
|
||||
assert ents2[0].label_ == "LOC"
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Just a sentence.",
|
||||
"Then one more sentence about London.",
|
||||
"Here is another one.",
|
||||
"I like London.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
def test_ner_warns_no_lookups(caplog):
|
||||
nlp = English()
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from spacy.attrs import DEP
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.training import Example
|
||||
from spacy.tokens import Doc
|
||||
|
@ -210,3 +213,16 @@ def test_overfitting_IO():
|
|||
assert doc2[0].dep_ == "nsubj"
|
||||
assert doc2[2].dep_ == "dobj"
|
||||
assert doc2[3].dep_ == "punct"
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Just a sentence.",
|
||||
"Then one more sentence about London.",
|
||||
"Here is another one.",
|
||||
"I like London.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
|
|
@ -1,5 +1,7 @@
|
|||
from typing import Callable, Iterable
|
||||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from spacy.attrs import ENT_KB_ID
|
||||
|
||||
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
||||
from spacy.vocab import Vocab
|
||||
|
@ -496,6 +498,19 @@ def test_overfitting_IO():
|
|||
predictions.append(ent.kb_id_)
|
||||
assert predictions == GOLD_entities
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Russ Cochran captured his first major title with his son as caddie.",
|
||||
"Russ Cochran his reprints include EC Comics.",
|
||||
"Russ Cochran has been publishing comic art.",
|
||||
"Russ Cochran was a member of University of Kentucky's golf team.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
def test_kb_serialization():
|
||||
# Test that the KB can be used in a pipeline with a different vocab
|
||||
|
|
107
spacy/tests/pipeline/test_models.py
Normal file
107
spacy/tests/pipeline/test_models.py
Normal file
|
@ -0,0 +1,107 @@
|
|||
from typing import List
|
||||
|
||||
import numpy
|
||||
import pytest
|
||||
from numpy.testing import assert_almost_equal
|
||||
from spacy.vocab import Vocab
|
||||
from thinc.api import NumpyOps, Model, data_validation
|
||||
from thinc.types import Array2d, Ragged
|
||||
|
||||
from spacy.lang.en import English
|
||||
from spacy.ml import FeatureExtractor, StaticVectors
|
||||
from spacy.ml._character_embed import CharacterEmbed
|
||||
from spacy.tokens import Doc
|
||||
|
||||
|
||||
OPS = NumpyOps()
|
||||
|
||||
texts = ["These are 4 words", "Here just three"]
|
||||
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
|
||||
l1 = [[9, 8], [7, 6], [5, 4]]
|
||||
list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")]
|
||||
list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")]
|
||||
array = OPS.xp.asarray(l1, dtype="f")
|
||||
ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i"))
|
||||
|
||||
|
||||
def get_docs():
|
||||
vocab = Vocab()
|
||||
for t in texts:
|
||||
for word in t.split():
|
||||
hash_id = vocab.strings.add(word)
|
||||
vector = numpy.random.uniform(-1, 1, (7,))
|
||||
vocab.set_vector(hash_id, vector)
|
||||
docs = [English(vocab)(t) for t in texts]
|
||||
return docs
|
||||
|
||||
|
||||
# Test components with a model of type Model[List[Doc], List[Floats2d]]
|
||||
@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"])
|
||||
def test_components_batching_list(name):
|
||||
nlp = English()
|
||||
proc = nlp.create_pipe(name)
|
||||
util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats)
|
||||
|
||||
|
||||
# Test components with a model of type Model[List[Doc], Floats2d]
|
||||
@pytest.mark.parametrize("name", ["textcat"])
|
||||
def test_components_batching_array(name):
|
||||
nlp = English()
|
||||
proc = nlp.create_pipe(name)
|
||||
util_batch_unbatch_docs_array(proc.model, get_docs(), array)
|
||||
|
||||
|
||||
LAYERS = [
|
||||
(CharacterEmbed(nM=5, nC=3), get_docs(), list_floats),
|
||||
(FeatureExtractor([100, 200]), get_docs(), list_ints),
|
||||
(StaticVectors(), get_docs(), ragged),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("model,in_data,out_data", LAYERS)
|
||||
def test_layers_batching_all(model, in_data, out_data):
|
||||
# In = List[Doc]
|
||||
if isinstance(in_data, list) and isinstance(in_data[0], Doc):
|
||||
if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2:
|
||||
util_batch_unbatch_docs_array(model, in_data, out_data)
|
||||
elif (
|
||||
isinstance(out_data, list)
|
||||
and isinstance(out_data[0], OPS.xp.ndarray)
|
||||
and out_data[0].ndim == 2
|
||||
):
|
||||
util_batch_unbatch_docs_list(model, in_data, out_data)
|
||||
elif isinstance(out_data, Ragged):
|
||||
util_batch_unbatch_docs_ragged(model, in_data, out_data)
|
||||
|
||||
|
||||
def util_batch_unbatch_docs_list(
|
||||
model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d]
|
||||
):
|
||||
with data_validation(True):
|
||||
model.initialize(in_data, out_data)
|
||||
Y_batched = model.predict(in_data)
|
||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||
for i in range(len(Y_batched)):
|
||||
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
|
||||
|
||||
|
||||
def util_batch_unbatch_docs_array(
|
||||
model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d
|
||||
):
|
||||
with data_validation(True):
|
||||
model.initialize(in_data, out_data)
|
||||
Y_batched = model.predict(in_data).tolist()
|
||||
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||
|
||||
|
||||
def util_batch_unbatch_docs_ragged(
|
||||
model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged
|
||||
):
|
||||
with data_validation(True):
|
||||
model.initialize(in_data, out_data)
|
||||
Y_batched = model.predict(in_data)
|
||||
Y_not_batched = []
|
||||
for u in in_data:
|
||||
Y_not_batched.extend(model.predict([u]).data.tolist())
|
||||
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
|
|
@ -1,4 +1,5 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
|
||||
from spacy import util
|
||||
from spacy.training import Example
|
||||
|
@ -6,6 +7,7 @@ from spacy.lang.en import English
|
|||
from spacy.language import Language
|
||||
from spacy.tests.util import make_tempdir
|
||||
from spacy.morphology import Morphology
|
||||
from spacy.attrs import MORPH
|
||||
|
||||
|
||||
def test_label_types():
|
||||
|
@ -101,3 +103,16 @@ def test_overfitting_IO():
|
|||
doc2 = nlp2(test_text)
|
||||
assert [str(t.morph) for t in doc2] == gold_morphs
|
||||
assert [t.pos_ for t in doc2] == gold_pos_tags
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Just a sentence.",
|
||||
"Then one more sentence about London.",
|
||||
"Here is another one.",
|
||||
"I like London.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
|
|
@ -1,4 +1,6 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from spacy.attrs import SENT_START
|
||||
|
||||
from spacy import util
|
||||
from spacy.training import Example
|
||||
|
@ -80,3 +82,18 @@ def test_overfitting_IO():
|
|||
nlp2 = util.load_model_from_path(tmp_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Just a sentence.",
|
||||
"Then one more sentence about London.",
|
||||
"Here is another one.",
|
||||
"I like London.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [
|
||||
doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts]
|
||||
]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
|
|
@ -1,4 +1,7 @@
|
|||
import pytest
|
||||
from numpy.testing import assert_equal
|
||||
from spacy.attrs import TAG
|
||||
|
||||
from spacy import util
|
||||
from spacy.training import Example
|
||||
from spacy.lang.en import English
|
||||
|
@ -117,6 +120,19 @@ def test_overfitting_IO():
|
|||
assert doc2[2].tag_ is "J"
|
||||
assert doc2[3].tag_ is "N"
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = [
|
||||
"Just a sentence.",
|
||||
"I like green eggs.",
|
||||
"Here is another one.",
|
||||
"I eat ham.",
|
||||
]
|
||||
batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
def test_tagger_requires_labels():
|
||||
nlp = English()
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import pytest
|
||||
import random
|
||||
import numpy.random
|
||||
from numpy.testing import assert_equal
|
||||
from thinc.api import fix_random_seed
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
|
@ -174,6 +175,14 @@ def test_overfitting_IO():
|
|||
assert scores["cats_score"] == 1.0
|
||||
assert "cats_score_desc" in scores
|
||||
|
||||
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
|
||||
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||
assert_equal(batch_deps_1, batch_deps_2)
|
||||
assert_equal(batch_deps_1, no_batch_deps)
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
|
|
76
spacy/tests/regression/test_issue5501-6000.py
Normal file
76
spacy/tests/regression/test_issue5501-6000.py
Normal file
|
@ -0,0 +1,76 @@
|
|||
from thinc.api import fix_random_seed
|
||||
from spacy.lang.en import English
|
||||
from spacy.tokens import Span
|
||||
from spacy import displacy
|
||||
from spacy.pipeline import merge_entities
|
||||
|
||||
|
||||
def test_issue5551():
|
||||
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
||||
component = "textcat"
|
||||
pipe_cfg = {
|
||||
"model": {
|
||||
"@architectures": "spacy.TextCatBOW.v1",
|
||||
"exclusive_classes": True,
|
||||
"ngram_size": 2,
|
||||
"no_output_layer": False,
|
||||
}
|
||||
}
|
||||
results = []
|
||||
for i in range(3):
|
||||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
example = (
|
||||
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
|
||||
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
|
||||
)
|
||||
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
|
||||
for label in set(example[1]["cats"]):
|
||||
pipe.add_label(label)
|
||||
nlp.initialize()
|
||||
# Store the result of each iteration
|
||||
result = pipe.model.predict([nlp.make_doc(example[0])])
|
||||
results.append(list(result[0]))
|
||||
# All results should be the same because of the fixed seed
|
||||
assert len(results) == 3
|
||||
assert results[0] == results[1]
|
||||
assert results[0] == results[2]
|
||||
|
||||
|
||||
def test_issue5838():
|
||||
# Displacy's EntityRenderer break line
|
||||
# not working after last entity
|
||||
sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n"
|
||||
nlp = English()
|
||||
doc = nlp(sample_text)
|
||||
doc.ents = [Span(doc, 7, 8, label="test")]
|
||||
html = displacy.render(doc, style="ent")
|
||||
found = html.count("</br>")
|
||||
assert found == 4
|
||||
|
||||
|
||||
def test_issue5918():
|
||||
# Test edge case when merging entities.
|
||||
nlp = English()
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
patterns = [
|
||||
{"label": "ORG", "pattern": "Digicon Inc"},
|
||||
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
|
||||
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
|
||||
]
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
text = """
|
||||
Digicon Inc said it has completed the previously-announced disposition
|
||||
of its computer systems division to an investment group led by
|
||||
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
|
||||
"""
|
||||
doc = nlp(text)
|
||||
assert len(doc.ents) == 3
|
||||
# make it so that the third span's head is within the entity (ent_iob=I)
|
||||
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
|
||||
# TODO: test for logging here
|
||||
# with pytest.warns(UserWarning):
|
||||
# doc[29].head = doc[33]
|
||||
doc = merge_entities(doc)
|
||||
assert len(doc.ents) == 3
|
|
@ -1,37 +0,0 @@
|
|||
from spacy.lang.en import English
|
||||
from spacy.util import fix_random_seed
|
||||
|
||||
|
||||
def test_issue5551():
|
||||
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
||||
component = "textcat"
|
||||
pipe_cfg = {
|
||||
"model": {
|
||||
"@architectures": "spacy.TextCatBOW.v1",
|
||||
"exclusive_classes": True,
|
||||
"ngram_size": 2,
|
||||
"no_output_layer": False,
|
||||
}
|
||||
}
|
||||
|
||||
results = []
|
||||
for i in range(3):
|
||||
fix_random_seed(0)
|
||||
nlp = English()
|
||||
example = (
|
||||
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
|
||||
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
|
||||
)
|
||||
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
|
||||
for label in set(example[1]["cats"]):
|
||||
pipe.add_label(label)
|
||||
nlp.initialize()
|
||||
|
||||
# Store the result of each iteration
|
||||
result = pipe.model.predict([nlp.make_doc(example[0])])
|
||||
results.append(list(result[0]))
|
||||
|
||||
# All results should be the same because of the fixed seed
|
||||
assert len(results) == 3
|
||||
assert results[0] == results[1]
|
||||
assert results[0] == results[2]
|
|
@ -1,23 +0,0 @@
|
|||
from spacy.lang.en import English
|
||||
from spacy.tokens import Span
|
||||
from spacy import displacy
|
||||
|
||||
|
||||
SAMPLE_TEXT = """First line
|
||||
Second line, with ent
|
||||
Third line
|
||||
Fourth line
|
||||
"""
|
||||
|
||||
|
||||
def test_issue5838():
|
||||
# Displacy's EntityRenderer break line
|
||||
# not working after last entity
|
||||
|
||||
nlp = English()
|
||||
doc = nlp(SAMPLE_TEXT)
|
||||
doc.ents = [Span(doc, 7, 8, label="test")]
|
||||
|
||||
html = displacy.render(doc, style="ent")
|
||||
found = html.count("</br>")
|
||||
assert found == 4
|
|
@ -1,29 +0,0 @@
|
|||
from spacy.lang.en import English
|
||||
from spacy.pipeline import merge_entities
|
||||
|
||||
|
||||
def test_issue5918():
|
||||
# Test edge case when merging entities.
|
||||
nlp = English()
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
patterns = [
|
||||
{"label": "ORG", "pattern": "Digicon Inc"},
|
||||
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
|
||||
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
|
||||
]
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
text = """
|
||||
Digicon Inc said it has completed the previously-announced disposition
|
||||
of its computer systems division to an investment group led by
|
||||
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
|
||||
"""
|
||||
doc = nlp(text)
|
||||
assert len(doc.ents) == 3
|
||||
# make it so that the third span's head is within the entity (ent_iob=I)
|
||||
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
|
||||
# TODO: test for logging here
|
||||
# with pytest.warns(UserWarning):
|
||||
# doc[29].head = doc[33]
|
||||
doc = merge_entities(doc)
|
||||
assert len(doc.ents) == 3
|
|
@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
|||
docs = [docs]
|
||||
json_doc = {"id": doc_id, "paragraphs": []}
|
||||
for i, doc in enumerate(docs):
|
||||
json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
|
||||
raw = None if doc.has_unknown_spaces else doc.text
|
||||
json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []}
|
||||
for cat, val in doc.cats.items():
|
||||
json_cat = {"label": cat, "value": val}
|
||||
json_para["cats"].append(json_cat)
|
||||
|
|
|
@ -1,19 +1,18 @@
|
|||
import { Help } from 'components/typography'; import Link from 'components/link'
|
||||
|
||||
<!-- TODO: update speed and v2 NER numbers -->
|
||||
|
||||
<figure>
|
||||
|
||||
| Pipeline | Parser | Tagger | NER | WPS<br />CPU <Help>words per second on CPU, higher is better</Help> | WPS<br/>GPU <Help>words per second on GPU, higher is better</Help> |
|
||||
| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.7 | 1k | 8k |
|
||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.8 | 7k | |
|
||||
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | | 10k | |
|
||||
| Pipeline | Parser | Tagger | NER |
|
||||
| ---------------------------------------------------------- | -----: | -----: | ---: |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.4 |
|
||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.4 |
|
||||
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
|
||||
|
||||
<figcaption class="caption">
|
||||
|
||||
**Full pipeline accuracy and speed** on the
|
||||
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
|
||||
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on
|
||||
the development set).
|
||||
|
||||
</figcaption>
|
||||
|
||||
|
@ -22,13 +21,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
|||
<figure>
|
||||
|
||||
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
||||
| ------------------------------------------------------------------------------ | --------: | --------: |
|
||||
| -------------------------------- | --------: | --------: |
|
||||
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
|
||||
| spaCy CNN (2020) | 84.5 | |
|
||||
| spaCy CNN (2017) | | |
|
||||
| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
||||
| <Link to="https://github.com/flairNLP/flair" hideIcon>Flair</Link><sup>2</sup> | 89.7 | 93.1 |
|
||||
| BERT Base<sup>3</sup> | - | 92.4 |
|
||||
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
||||
| Flair<sup>2</sup> | 89.7 | 93.1 |
|
||||
|
||||
<figcaption class="caption">
|
||||
|
||||
|
@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
|||
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
|
||||
[CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
|
||||
[NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
|
||||
more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf).
|
||||
**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3.
|
||||
** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805).
|
||||
more results. Project template:
|
||||
[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. **
|
||||
[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. **
|
||||
[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).
|
||||
|
||||
</figcaption>
|
||||
|
||||
|
|
|
@ -10,6 +10,18 @@ menu:
|
|||
|
||||
## Comparison {#comparison hidden="true"}
|
||||
|
||||
spaCy is a **free, open-source library** for advanced **Natural Language
|
||||
Processing** (NLP) in Python. It's designed specifically for **production use**
|
||||
and helps you build applications that process and "understand" large volumes of
|
||||
text. It can be used to build information extraction or natural language
|
||||
understanding systems.
|
||||
|
||||
### Feature overview {#comparison-features}
|
||||
|
||||
import Features from 'widgets/features.js'
|
||||
|
||||
<Features />
|
||||
|
||||
### When should I use spaCy? {#comparison-usage}
|
||||
|
||||
- ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
|
||||
|
@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
|
||||
| Dependency Parsing System | UAS | LAS |
|
||||
| ------------------------------------------------------------------------------ | ---: | ---: |
|
||||
| spaCy RoBERTa (2020)<sup>1</sup> | 95.5 | 94.3 |
|
||||
| spaCy CNN (2020)<sup>1</sup> | | |
|
||||
| spaCy RoBERTa (2020) | 95.5 | 94.3 |
|
||||
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
||||
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
||||
|
||||
|
@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
|
||||
**Dependency parsing accuracy** on the Penn Treebank. See
|
||||
[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
|
||||
results. **1. ** Project template:
|
||||
results. Project template:
|
||||
[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
|
||||
|
||||
</figcaption>
|
||||
|
|
|
@ -490,7 +490,7 @@ phrases, so that you can resolve overlaps and other conflicts in whatever way
|
|||
you prefer.
|
||||
|
||||
| Argument | Description |
|
||||
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| `matcher` | The matcher instance. ~~Matcher~~ |
|
||||
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
||||
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
||||
|
@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
|
|||
containing a match and render them with the
|
||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
||||
lets you determine the sentence containing the match, `doc[start:end].sent`,
|
||||
and calculate the start and end of the matched span within the sentence. Using
|
||||
lets you determine the sentence containing the match, `doc[start:end].sent`, and
|
||||
calculate the start and end of the matched span within the sentence. Using
|
||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
||||
list of dictionaries containing the text and entities to render.
|
||||
|
||||
|
|
|
@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
|
||||
<Benchmarks />
|
||||
|
||||
#### New trained transformer-based pipelines {#features-transformers-pipelines}
|
||||
|
||||
> #### Notes on model capabilities
|
||||
>
|
||||
> The models are each trained with a **single transformer** shared across the
|
||||
> pipeline, which requires it to be trained on a single corpus. For
|
||||
> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
|
||||
> corpus, which has annotations across several tasks. For [French](/models/fr),
|
||||
> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
|
||||
> corpus that had both syntactic and entity annotations, so the transformer
|
||||
> models for those languages do not include NER.
|
||||
|
||||
| Package | Language | Transformer | Tagger | Parser | NER |
|
||||
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
|
||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.0 | 89.4 |
|
||||
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
|
||||
| [`es_dep_news_trf`](/models/es#es_dep_news_trf) | Spanish | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 98.2 | 94.6 | - |
|
||||
| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf) | French | [`camembert-base`](https://huggingface.co/camembert-base) | 95.7 | 94.9 | - |
|
||||
| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 92.5 | 77.2 | 75.6 |
|
||||
|
||||
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||
|
||||
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||||
|
@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
|||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||||
[TransformerListener](/api/architectures#TransformerListener),
|
||||
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||||
- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf),
|
||||
[`de_dep_news_trf`](/models/de#de_dep_news_trf),
|
||||
[`es_dep_news_trf`](/models/es#es_dep_news_trf),
|
||||
[`fr_dep_news_trf`](/models/fr#fr_dep_news_trf),
|
||||
[`zh_core_web_trf`](/models/zh#zh_core_web_trf)
|
||||
- **Implementation:**
|
||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||||
|
||||
|
|
72
website/src/widgets/features.js
Normal file
72
website/src/widgets/features.js
Normal file
|
@ -0,0 +1,72 @@
|
|||
import React from 'react'
|
||||
import { graphql, StaticQuery } from 'gatsby'
|
||||
|
||||
import { Ul, Li } from '../components/list'
|
||||
|
||||
export default () => (
|
||||
<StaticQuery
|
||||
query={query}
|
||||
render={({ site }) => {
|
||||
const { counts } = site.siteMetadata
|
||||
return (
|
||||
<Ul>
|
||||
<Li>
|
||||
✅ Support for <strong>{counts.langs}+ languages</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ <strong>{counts.models} trained pipelines</strong> for{' '}
|
||||
{counts.modelLangs} languages
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Multi-task learning with pretrained <strong>transformers</strong> like
|
||||
BERT
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Pretrained <strong>word vectors</strong>
|
||||
</Li>
|
||||
<Li>✅ State-of-the-art speed</Li>
|
||||
<Li>
|
||||
✅ Production-ready <strong>training system</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Linguistically-motivated <strong>tokenization</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Components for <strong>named entity</strong> recognition, part-of-speech
|
||||
tagging, dependency parsing, sentence segmentation,{' '}
|
||||
<strong>text classification</strong>, lemmatization, morphological analysis,
|
||||
entity linking and more
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Easily extensible with <strong>custom components</strong> and attributes
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Support for custom models in <strong>PyTorch</strong>,{' '}
|
||||
<strong>TensorFlow</strong> and other frameworks
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Built in <strong>visualizers</strong> for syntax and NER
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Easy <strong>model packaging</strong>, deployment and workflow management
|
||||
</Li>
|
||||
<Li>✅ Robust, rigorously evaluated accuracy</Li>
|
||||
</Ul>
|
||||
)
|
||||
}}
|
||||
/>
|
||||
)
|
||||
|
||||
const query = graphql`
|
||||
query FeaturesQuery {
|
||||
site {
|
||||
siteMetadata {
|
||||
counts {
|
||||
langs
|
||||
modelLangs
|
||||
models
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
`
|
|
@ -14,13 +14,13 @@ import {
|
|||
LandingBanner,
|
||||
} from '../components/landing'
|
||||
import { H2 } from '../components/typography'
|
||||
import { Ul, Li } from '../components/list'
|
||||
import { InlineCode } from '../components/code'
|
||||
import Button from '../components/button'
|
||||
import Link from '../components/link'
|
||||
|
||||
import QuickstartTraining from './quickstart-training'
|
||||
import Project from './project'
|
||||
import Features from './features'
|
||||
import courseImage from '../../docs/images/course.jpg'
|
||||
import prodigyImage from '../../docs/images/prodigy_overview.jpg'
|
||||
import projectsImage from '../../docs/images/projects.png'
|
||||
|
@ -56,7 +56,7 @@ for entity in doc.ents:
|
|||
}
|
||||
|
||||
const Landing = ({ data }) => {
|
||||
const { counts, nightly } = data
|
||||
const { nightly } = data
|
||||
const codeExample = getCodeExample(nightly)
|
||||
return (
|
||||
<>
|
||||
|
@ -98,51 +98,7 @@ const Landing = ({ data }) => {
|
|||
|
||||
<LandingCol>
|
||||
<H2>Features</H2>
|
||||
<Ul>
|
||||
<Li>
|
||||
✅ Support for <strong>{counts.langs}+ languages</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ <strong>{counts.models} trained pipelines</strong> for{' '}
|
||||
{counts.modelLangs} languages
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Multi-task learning with pretrained <strong>transformers</strong>{' '}
|
||||
like BERT
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Pretrained <strong>word vectors</strong>
|
||||
</Li>
|
||||
<Li>✅ State-of-the-art speed</Li>
|
||||
<Li>
|
||||
✅ Production-ready <strong>training system</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Linguistically-motivated <strong>tokenization</strong>
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Components for <strong>named entity</strong> recognition,
|
||||
part-of-speech tagging, dependency parsing, sentence segmentation,{' '}
|
||||
<strong>text classification</strong>, lemmatization, morphological
|
||||
analysis, entity linking and more
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Easily extensible with <strong>custom components</strong> and
|
||||
attributes
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Support for custom models in <strong>PyTorch</strong>,{' '}
|
||||
<strong>TensorFlow</strong> and other frameworks
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Built in <strong>visualizers</strong> for syntax and NER
|
||||
</Li>
|
||||
<Li>
|
||||
✅ Easy <strong>model packaging</strong>, deployment and workflow
|
||||
management
|
||||
</Li>
|
||||
<Li>✅ Robust, rigorously evaluated accuracy</Li>
|
||||
</Ul>
|
||||
<Features />
|
||||
</LandingCol>
|
||||
</LandingGrid>
|
||||
|
||||
|
@ -333,11 +289,6 @@ const landingQuery = graphql`
|
|||
siteMetadata {
|
||||
nightly
|
||||
repo
|
||||
counts {
|
||||
langs
|
||||
modelLangs
|
||||
models
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue
Block a user