mirror of
https://github.com/explosion/spaCy.git
synced 2025-02-11 09:00:36 +03:00
Merge branch 'develop' into nightly.spacy.io
This commit is contained in:
commit
7f440275ab
106
.github/contributors/Nuccy90.md
vendored
Normal file
106
.github/contributors/Nuccy90.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Elena Fano |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 2020-09-21 |
|
||||||
|
| GitHub username | Nuccy90 |
|
||||||
|
| Website (optional) | |
|
106
.github/contributors/rahul1990gupta.md
vendored
Normal file
106
.github/contributors/rahul1990gupta.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
||||||
|
# spaCy contributor agreement
|
||||||
|
|
||||||
|
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||||
|
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||||
|
The SCA applies to any contribution that you make to any product or project
|
||||||
|
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||||
|
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||||
|
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||||
|
**"you"** shall mean the person or entity identified below.
|
||||||
|
|
||||||
|
If you agree to be bound by these terms, fill in the information requested
|
||||||
|
below and include the filled-in version with your first pull request, under the
|
||||||
|
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||||
|
should be your GitHub username, with the extension `.md`. For example, the user
|
||||||
|
example_user would create the file `.github/contributors/example_user.md`.
|
||||||
|
|
||||||
|
Read this agreement carefully before signing. These terms and conditions
|
||||||
|
constitute a binding legal agreement.
|
||||||
|
|
||||||
|
## Contributor Agreement
|
||||||
|
|
||||||
|
1. The term "contribution" or "contributed materials" means any source code,
|
||||||
|
object code, patch, tool, sample, graphic, specification, manual,
|
||||||
|
documentation, or any other material posted or submitted by you to the project.
|
||||||
|
|
||||||
|
2. With respect to any worldwide copyrights, or copyright applications and
|
||||||
|
registrations, in your contribution:
|
||||||
|
|
||||||
|
* you hereby assign to us joint ownership, and to the extent that such
|
||||||
|
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||||
|
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||||
|
royalty-free, unrestricted license to exercise all rights under those
|
||||||
|
copyrights. This includes, at our option, the right to sublicense these same
|
||||||
|
rights to third parties through multiple levels of sublicensees or other
|
||||||
|
licensing arrangements;
|
||||||
|
|
||||||
|
* you agree that each of us can do all things in relation to your
|
||||||
|
contribution as if each of us were the sole owners, and if one of us makes
|
||||||
|
a derivative work of your contribution, the one who makes the derivative
|
||||||
|
work (or has it made will be the sole owner of that derivative work;
|
||||||
|
|
||||||
|
* you agree that you will not assert any moral rights in your contribution
|
||||||
|
against us, our licensees or transferees;
|
||||||
|
|
||||||
|
* you agree that we may register a copyright in your contribution and
|
||||||
|
exercise all ownership rights associated with it; and
|
||||||
|
|
||||||
|
* you agree that neither of us has any duty to consult with, obtain the
|
||||||
|
consent of, pay or render an accounting to the other for any use or
|
||||||
|
distribution of your contribution.
|
||||||
|
|
||||||
|
3. With respect to any patents you own, or that you can license without payment
|
||||||
|
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||||
|
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||||
|
|
||||||
|
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||||
|
your contribution in whole or in part, alone or in combination with or
|
||||||
|
included in any product, work or materials arising out of the project to
|
||||||
|
which your contribution was submitted, and
|
||||||
|
|
||||||
|
* at our option, to sublicense these same rights to third parties through
|
||||||
|
multiple levels of sublicensees or other licensing arrangements.
|
||||||
|
|
||||||
|
4. Except as set out above, you keep all right, title, and interest in your
|
||||||
|
contribution. The rights that you grant to us under these terms are effective
|
||||||
|
on the date you first submitted a contribution to us, even if your submission
|
||||||
|
took place before the date you sign these terms.
|
||||||
|
|
||||||
|
5. You covenant, represent, warrant and agree that:
|
||||||
|
|
||||||
|
* Each contribution that you submit is and shall be an original work of
|
||||||
|
authorship and you can legally grant the rights set out in this SCA;
|
||||||
|
|
||||||
|
* to the best of your knowledge, each contribution will not violate any
|
||||||
|
third party's copyrights, trademarks, patents, or other intellectual
|
||||||
|
property rights; and
|
||||||
|
|
||||||
|
* each contribution shall be in compliance with U.S. export control laws and
|
||||||
|
other applicable export and import laws. You agree to notify us if you
|
||||||
|
become aware of any circumstance which would make any of the foregoing
|
||||||
|
representations inaccurate in any respect. We may publicly disclose your
|
||||||
|
participation in the project, including the fact that you have signed the SCA.
|
||||||
|
|
||||||
|
6. This SCA is governed by the laws of the State of California and applicable
|
||||||
|
U.S. Federal law. Any choice of law rules will not apply.
|
||||||
|
|
||||||
|
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||||
|
mark both statements:
|
||||||
|
|
||||||
|
* [x] I am signing on behalf of myself as an individual and no other person
|
||||||
|
or entity, including my employer, has or will have rights with respect to my
|
||||||
|
contributions.
|
||||||
|
|
||||||
|
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||||
|
actual authority to contractually bind that entity.
|
||||||
|
|
||||||
|
## Contributor Details
|
||||||
|
|
||||||
|
| Field | Entry |
|
||||||
|
|------------------------------- | -------------------- |
|
||||||
|
| Name | Rahul Gupta |
|
||||||
|
| Company name (if applicable) | |
|
||||||
|
| Title or role (if applicable) | |
|
||||||
|
| Date | 28 July 2020 |
|
||||||
|
| GitHub username | rahul1990gupta |
|
||||||
|
| Website (optional) | |
|
|
@ -1,6 +1,6 @@
|
||||||
# fmt: off
|
# fmt: off
|
||||||
__title__ = "spacy-nightly"
|
__title__ = "spacy-nightly"
|
||||||
__version__ = "3.0.0a41"
|
__version__ = "3.0.0rc1"
|
||||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||||
__projects__ = "https://github.com/explosion/projects"
|
__projects__ = "https://github.com/explosion/projects"
|
||||||
|
|
|
@ -10,23 +10,26 @@ _stem_suffixes = [
|
||||||
["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
|
["ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां"],
|
||||||
["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
|
["ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां"]
|
||||||
]
|
]
|
||||||
# fmt: on
|
|
||||||
|
|
||||||
# reference 1:https://en.wikipedia.org/wiki/Indian_numbering_system
|
# reference 1: https://en.wikipedia.org/wiki/Indian_numbering_system
|
||||||
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
|
# reference 2: https://blogs.transparent.com/hindi/hindi-numbers-1-100/
|
||||||
|
# reference 3: https://www.mindurhindi.com/basic-words-and-phrases-in-hindi/
|
||||||
|
|
||||||
_num_words = [
|
_one_to_ten = [
|
||||||
"शून्य",
|
"शून्य",
|
||||||
"एक",
|
"एक",
|
||||||
"दो",
|
"दो",
|
||||||
"तीन",
|
"तीन",
|
||||||
"चार",
|
"चार",
|
||||||
"पांच",
|
"पांच", "पाँच",
|
||||||
"छह",
|
"छह",
|
||||||
"सात",
|
"सात",
|
||||||
"आठ",
|
"आठ",
|
||||||
"नौ",
|
"नौ",
|
||||||
"दस",
|
"दस",
|
||||||
|
]
|
||||||
|
|
||||||
|
_eleven_to_beyond = [
|
||||||
"ग्यारह",
|
"ग्यारह",
|
||||||
"बारह",
|
"बारह",
|
||||||
"तेरह",
|
"तेरह",
|
||||||
|
@ -37,13 +40,85 @@ _num_words = [
|
||||||
"अठारह",
|
"अठारह",
|
||||||
"उन्नीस",
|
"उन्नीस",
|
||||||
"बीस",
|
"बीस",
|
||||||
|
"इकीस", "इक्कीस",
|
||||||
|
"बाईस",
|
||||||
|
"तेइस",
|
||||||
|
"चौबीस",
|
||||||
|
"पच्चीस",
|
||||||
|
"छब्बीस",
|
||||||
|
"सताइस", "सत्ताइस",
|
||||||
|
"अट्ठाइस",
|
||||||
|
"उनतीस",
|
||||||
"तीस",
|
"तीस",
|
||||||
|
"इकतीस", "इकत्तीस",
|
||||||
|
"बतीस", "बत्तीस",
|
||||||
|
"तैंतीस",
|
||||||
|
"चौंतीस",
|
||||||
|
"पैंतीस",
|
||||||
|
"छतीस", "छत्तीस",
|
||||||
|
"सैंतीस",
|
||||||
|
"अड़तीस",
|
||||||
|
"उनतालीस", "उनत्तीस",
|
||||||
"चालीस",
|
"चालीस",
|
||||||
|
"इकतालीस",
|
||||||
|
"बयालीस",
|
||||||
|
"तैतालीस",
|
||||||
|
"चवालीस",
|
||||||
|
"पैंतालीस",
|
||||||
|
"छयालिस",
|
||||||
|
"सैंतालीस",
|
||||||
|
"अड़तालीस",
|
||||||
|
"उनचास",
|
||||||
"पचास",
|
"पचास",
|
||||||
|
"इक्यावन",
|
||||||
|
"बावन",
|
||||||
|
"तिरपन", "तिरेपन",
|
||||||
|
"चौवन", "चउवन",
|
||||||
|
"पचपन",
|
||||||
|
"छप्पन",
|
||||||
|
"सतावन", "सत्तावन",
|
||||||
|
"अठावन",
|
||||||
|
"उनसठ",
|
||||||
"साठ",
|
"साठ",
|
||||||
|
"इकसठ",
|
||||||
|
"बासठ",
|
||||||
|
"तिरसठ", "तिरेसठ",
|
||||||
|
"चौंसठ",
|
||||||
|
"पैंसठ",
|
||||||
|
"छियासठ",
|
||||||
|
"सड़सठ",
|
||||||
|
"अड़सठ",
|
||||||
|
"उनहत्तर",
|
||||||
"सत्तर",
|
"सत्तर",
|
||||||
|
"इकहत्तर"
|
||||||
|
"बहत्तर",
|
||||||
|
"तिहत्तर",
|
||||||
|
"चौहत्तर",
|
||||||
|
"पचहत्तर",
|
||||||
|
"छिहत्तर",
|
||||||
|
"सतहत्तर",
|
||||||
|
"अठहत्तर",
|
||||||
|
"उन्नासी", "उन्यासी"
|
||||||
"अस्सी",
|
"अस्सी",
|
||||||
|
"इक्यासी",
|
||||||
|
"बयासी",
|
||||||
|
"तिरासी",
|
||||||
|
"चौरासी",
|
||||||
|
"पचासी",
|
||||||
|
"छियासी",
|
||||||
|
"सतासी",
|
||||||
|
"अट्ठासी",
|
||||||
|
"नवासी",
|
||||||
"नब्बे",
|
"नब्बे",
|
||||||
|
"इक्यानवे",
|
||||||
|
"बानवे",
|
||||||
|
"तिरानवे",
|
||||||
|
"चौरानवे",
|
||||||
|
"पचानवे",
|
||||||
|
"छियानवे",
|
||||||
|
"सतानवे",
|
||||||
|
"अट्ठानवे",
|
||||||
|
"निन्यानवे",
|
||||||
"सौ",
|
"सौ",
|
||||||
"हज़ार",
|
"हज़ार",
|
||||||
"लाख",
|
"लाख",
|
||||||
|
@ -52,6 +127,23 @@ _num_words = [
|
||||||
"खरब",
|
"खरब",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
_num_words = _one_to_ten + _eleven_to_beyond
|
||||||
|
|
||||||
|
_ordinal_words_one_to_ten = [
|
||||||
|
"प्रथम", "पहला",
|
||||||
|
"द्वितीय", "दूसरा",
|
||||||
|
"तृतीय", "तीसरा",
|
||||||
|
"चौथा",
|
||||||
|
"पांचवाँ",
|
||||||
|
"छठा",
|
||||||
|
"सातवाँ",
|
||||||
|
"आठवाँ",
|
||||||
|
"नौवाँ",
|
||||||
|
"दसवाँ",
|
||||||
|
]
|
||||||
|
_ordinal_suffix = "वाँ"
|
||||||
|
# fmt: on
|
||||||
|
|
||||||
|
|
||||||
def norm(string):
|
def norm(string):
|
||||||
# normalise base exceptions, e.g. punctuation or currency symbols
|
# normalise base exceptions, e.g. punctuation or currency symbols
|
||||||
|
@ -64,7 +156,7 @@ def norm(string):
|
||||||
for suffix_group in reversed(_stem_suffixes):
|
for suffix_group in reversed(_stem_suffixes):
|
||||||
length = len(suffix_group[0])
|
length = len(suffix_group[0])
|
||||||
if len(string) <= length:
|
if len(string) <= length:
|
||||||
break
|
continue
|
||||||
for suffix in suffix_group:
|
for suffix in suffix_group:
|
||||||
if string.endswith(suffix):
|
if string.endswith(suffix):
|
||||||
return string[:-length]
|
return string[:-length]
|
||||||
|
@ -74,7 +166,7 @@ def norm(string):
|
||||||
def like_num(text):
|
def like_num(text):
|
||||||
if text.startswith(("+", "-", "±", "~")):
|
if text.startswith(("+", "-", "±", "~")):
|
||||||
text = text[1:]
|
text = text[1:]
|
||||||
text = text.replace(", ", "").replace(".", "")
|
text = text.replace(",", "").replace(".", "")
|
||||||
if text.isdigit():
|
if text.isdigit():
|
||||||
return True
|
return True
|
||||||
if text.count("/") == 1:
|
if text.count("/") == 1:
|
||||||
|
@ -83,6 +175,14 @@ def like_num(text):
|
||||||
return True
|
return True
|
||||||
if text.lower() in _num_words:
|
if text.lower() in _num_words:
|
||||||
return True
|
return True
|
||||||
|
|
||||||
|
# check ordinal numbers
|
||||||
|
# reference: http://www.englishkitab.com/Vocabulary/Numbers.html
|
||||||
|
if text in _ordinal_words_one_to_ten:
|
||||||
|
return True
|
||||||
|
if text.endswith(_ordinal_suffix):
|
||||||
|
if text[: -len(_ordinal_suffix)] in _eleven_to_beyond:
|
||||||
|
return True
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -19,4 +19,6 @@ sentences = [
|
||||||
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
|
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
|
||||||
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
|
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
|
||||||
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
|
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
|
||||||
|
"என்ன வேலை செய்கிறீர்கள்?",
|
||||||
|
"எந்த கல்லூரியில் படிக்கிறாய்?",
|
||||||
]
|
]
|
||||||
|
|
|
@ -73,20 +73,16 @@ def like_num(text):
|
||||||
num, denom = text.split("/")
|
num, denom = text.split("/")
|
||||||
if num.isdigit() and denom.isdigit():
|
if num.isdigit() and denom.isdigit():
|
||||||
return True
|
return True
|
||||||
|
|
||||||
text_lower = text.lower()
|
text_lower = text.lower()
|
||||||
|
|
||||||
# Check cardinal number
|
# Check cardinal number
|
||||||
if text_lower in _num_words:
|
if text_lower in _num_words:
|
||||||
return True
|
return True
|
||||||
|
|
||||||
# Check ordinal number
|
# Check ordinal number
|
||||||
if text_lower in _ordinal_words:
|
if text_lower in _ordinal_words:
|
||||||
return True
|
return True
|
||||||
if text_lower.endswith(_ordinal_endings):
|
if text_lower.endswith(_ordinal_endings):
|
||||||
if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
|
if text_lower[:-3].isdigit() or text_lower[:-4].isdigit():
|
||||||
return True
|
return True
|
||||||
|
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
|
||||||
|
|
|
@ -1,6 +1,3 @@
|
||||||
# coding: utf8
|
|
||||||
from __future__ import unicode_literals
|
|
||||||
|
|
||||||
from ...symbols import NOUN, PROPN, PRON
|
from ...symbols import NOUN, PROPN, PRON
|
||||||
from ...errors import Errors
|
from ...errors import Errors
|
||||||
|
|
||||||
|
|
|
@ -125,6 +125,11 @@ def he_tokenizer():
|
||||||
return get_lang_class("he")().tokenizer
|
return get_lang_class("he")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture(scope="session")
|
||||||
|
def hi_tokenizer():
|
||||||
|
return get_lang_class("hi")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def hr_tokenizer():
|
def hr_tokenizer():
|
||||||
return get_lang_class("hr")().tokenizer
|
return get_lang_class("hr")().tokenizer
|
||||||
|
@ -240,11 +245,6 @@ def tr_tokenizer():
|
||||||
return get_lang_class("tr")().tokenizer
|
return get_lang_class("tr")().tokenizer
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
|
||||||
def tr_vocab():
|
|
||||||
return get_lang_class("tr").Defaults.create_vocab()
|
|
||||||
|
|
||||||
|
|
||||||
@pytest.fixture(scope="session")
|
@pytest.fixture(scope="session")
|
||||||
def tt_tokenizer():
|
def tt_tokenizer():
|
||||||
return get_lang_class("tt")().tokenizer
|
return get_lang_class("tt")().tokenizer
|
||||||
|
@ -297,11 +297,7 @@ def zh_tokenizer_pkuseg():
|
||||||
"segmenter": "pkuseg",
|
"segmenter": "pkuseg",
|
||||||
}
|
}
|
||||||
},
|
},
|
||||||
"initialize": {
|
"initialize": {"tokenizer": {"pkuseg_model": "web"}},
|
||||||
"tokenizer": {
|
|
||||||
"pkuseg_model": "web",
|
|
||||||
}
|
|
||||||
},
|
|
||||||
}
|
}
|
||||||
nlp = get_lang_class("zh").from_config(config)
|
nlp = get_lang_class("zh").from_config(config)
|
||||||
nlp.initialize()
|
nlp.initialize()
|
||||||
|
|
0
spacy/tests/lang/hi/__init__.py
Normal file
0
spacy/tests/lang/hi/__init__.py
Normal file
43
spacy/tests/lang/hi/test_lex_attrs.py
Normal file
43
spacy/tests/lang/hi/test_lex_attrs.py
Normal file
|
@ -0,0 +1,43 @@
|
||||||
|
import pytest
|
||||||
|
from spacy.lang.hi.lex_attrs import norm, like_num
|
||||||
|
|
||||||
|
|
||||||
|
def test_hi_tokenizer_handles_long_text(hi_tokenizer):
|
||||||
|
text = """
|
||||||
|
ये कहानी 1900 के दशक की है। कौशल्या (स्मिता जयकर) को पता चलता है कि उसका
|
||||||
|
छोटा बेटा, देवदास (शाहरुख खान) वापस घर आ रहा है। देवदास 10 साल पहले कानून की
|
||||||
|
पढ़ाई करने के लिए इंग्लैंड गया था। उसके लौटने की खुशी में ये बात कौशल्या अपनी पड़ोस
|
||||||
|
में रहने वाली सुमित्रा (किरण खेर) को भी बता देती है। इस खबर से वो भी खुश हो जाती है।
|
||||||
|
"""
|
||||||
|
tokens = hi_tokenizer(text)
|
||||||
|
assert len(tokens) == 86
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word,word_norm",
|
||||||
|
[
|
||||||
|
("चलता", "चल"),
|
||||||
|
("पढ़ाई", "पढ़"),
|
||||||
|
("देती", "दे"),
|
||||||
|
("जाती", "ज"),
|
||||||
|
("मुस्कुराकर", "मुस्कुर"),
|
||||||
|
],
|
||||||
|
)
|
||||||
|
def test_hi_norm(word, word_norm):
|
||||||
|
assert norm(word) == word_norm
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word",
|
||||||
|
["१९८७", "1987", "१२,२६७", "उन्नीस", "पाँच", "नवासी", "५/१०"],
|
||||||
|
)
|
||||||
|
def test_hi_like_num(word):
|
||||||
|
assert like_num(word)
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"word",
|
||||||
|
["पहला", "तृतीय", "निन्यानवेवाँ", "उन्नीस", "तिहत्तरवाँ", "छत्तीसवाँ"],
|
||||||
|
)
|
||||||
|
def test_hi_like_num_ordinal_words(word):
|
||||||
|
assert like_num(word)
|
|
@ -1,4 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
from spacy.attrs import ENT_IOB
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
|
@ -332,6 +335,19 @@ def test_overfitting_IO():
|
||||||
assert ents2[0].text == "London"
|
assert ents2[0].text == "London"
|
||||||
assert ents2[0].label_ == "LOC"
|
assert ents2[0].label_ == "LOC"
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"Then one more sentence about London.",
|
||||||
|
"Here is another one.",
|
||||||
|
"I like London.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([ENT_IOB]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.to_array([ENT_IOB]) for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
def test_ner_warns_no_lookups(caplog):
|
def test_ner_warns_no_lookups(caplog):
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -1,4 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
from spacy.attrs import DEP
|
||||||
|
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.tokens import Doc
|
from spacy.tokens import Doc
|
||||||
|
@ -210,3 +213,16 @@ def test_overfitting_IO():
|
||||||
assert doc2[0].dep_ == "nsubj"
|
assert doc2[0].dep_ == "nsubj"
|
||||||
assert doc2[2].dep_ == "dobj"
|
assert doc2[2].dep_ == "dobj"
|
||||||
assert doc2[3].dep_ == "punct"
|
assert doc2[3].dep_ == "punct"
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"Then one more sentence about London.",
|
||||||
|
"Here is another one.",
|
||||||
|
"I like London.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([DEP]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.to_array([DEP]) for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
|
@ -1,5 +1,7 @@
|
||||||
from typing import Callable, Iterable
|
from typing import Callable, Iterable
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
from spacy.attrs import ENT_KB_ID
|
||||||
|
|
||||||
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
from spacy.kb import KnowledgeBase, get_candidates, Candidate
|
||||||
from spacy.vocab import Vocab
|
from spacy.vocab import Vocab
|
||||||
|
@ -496,6 +498,19 @@ def test_overfitting_IO():
|
||||||
predictions.append(ent.kb_id_)
|
predictions.append(ent.kb_id_)
|
||||||
assert predictions == GOLD_entities
|
assert predictions == GOLD_entities
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Russ Cochran captured his first major title with his son as caddie.",
|
||||||
|
"Russ Cochran his reprints include EC Comics.",
|
||||||
|
"Russ Cochran has been publishing comic art.",
|
||||||
|
"Russ Cochran was a member of University of Kentucky's golf team.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([ENT_KB_ID]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.to_array([ENT_KB_ID]) for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
def test_kb_serialization():
|
def test_kb_serialization():
|
||||||
# Test that the KB can be used in a pipeline with a different vocab
|
# Test that the KB can be used in a pipeline with a different vocab
|
||||||
|
|
107
spacy/tests/pipeline/test_models.py
Normal file
107
spacy/tests/pipeline/test_models.py
Normal file
|
@ -0,0 +1,107 @@
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
import numpy
|
||||||
|
import pytest
|
||||||
|
from numpy.testing import assert_almost_equal
|
||||||
|
from spacy.vocab import Vocab
|
||||||
|
from thinc.api import NumpyOps, Model, data_validation
|
||||||
|
from thinc.types import Array2d, Ragged
|
||||||
|
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.ml import FeatureExtractor, StaticVectors
|
||||||
|
from spacy.ml._character_embed import CharacterEmbed
|
||||||
|
from spacy.tokens import Doc
|
||||||
|
|
||||||
|
|
||||||
|
OPS = NumpyOps()
|
||||||
|
|
||||||
|
texts = ["These are 4 words", "Here just three"]
|
||||||
|
l0 = [[1, 2], [3, 4], [5, 6], [7, 8]]
|
||||||
|
l1 = [[9, 8], [7, 6], [5, 4]]
|
||||||
|
list_floats = [OPS.xp.asarray(l0, dtype="f"), OPS.xp.asarray(l1, dtype="f")]
|
||||||
|
list_ints = [OPS.xp.asarray(l0, dtype="i"), OPS.xp.asarray(l1, dtype="i")]
|
||||||
|
array = OPS.xp.asarray(l1, dtype="f")
|
||||||
|
ragged = Ragged(array, OPS.xp.asarray([2, 1], dtype="i"))
|
||||||
|
|
||||||
|
|
||||||
|
def get_docs():
|
||||||
|
vocab = Vocab()
|
||||||
|
for t in texts:
|
||||||
|
for word in t.split():
|
||||||
|
hash_id = vocab.strings.add(word)
|
||||||
|
vector = numpy.random.uniform(-1, 1, (7,))
|
||||||
|
vocab.set_vector(hash_id, vector)
|
||||||
|
docs = [English(vocab)(t) for t in texts]
|
||||||
|
return docs
|
||||||
|
|
||||||
|
|
||||||
|
# Test components with a model of type Model[List[Doc], List[Floats2d]]
|
||||||
|
@pytest.mark.parametrize("name", ["tagger", "tok2vec", "morphologizer", "senter"])
|
||||||
|
def test_components_batching_list(name):
|
||||||
|
nlp = English()
|
||||||
|
proc = nlp.create_pipe(name)
|
||||||
|
util_batch_unbatch_docs_list(proc.model, get_docs(), list_floats)
|
||||||
|
|
||||||
|
|
||||||
|
# Test components with a model of type Model[List[Doc], Floats2d]
|
||||||
|
@pytest.mark.parametrize("name", ["textcat"])
|
||||||
|
def test_components_batching_array(name):
|
||||||
|
nlp = English()
|
||||||
|
proc = nlp.create_pipe(name)
|
||||||
|
util_batch_unbatch_docs_array(proc.model, get_docs(), array)
|
||||||
|
|
||||||
|
|
||||||
|
LAYERS = [
|
||||||
|
(CharacterEmbed(nM=5, nC=3), get_docs(), list_floats),
|
||||||
|
(FeatureExtractor([100, 200]), get_docs(), list_ints),
|
||||||
|
(StaticVectors(), get_docs(), ragged),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.mark.parametrize("model,in_data,out_data", LAYERS)
|
||||||
|
def test_layers_batching_all(model, in_data, out_data):
|
||||||
|
# In = List[Doc]
|
||||||
|
if isinstance(in_data, list) and isinstance(in_data[0], Doc):
|
||||||
|
if isinstance(out_data, OPS.xp.ndarray) and out_data.ndim == 2:
|
||||||
|
util_batch_unbatch_docs_array(model, in_data, out_data)
|
||||||
|
elif (
|
||||||
|
isinstance(out_data, list)
|
||||||
|
and isinstance(out_data[0], OPS.xp.ndarray)
|
||||||
|
and out_data[0].ndim == 2
|
||||||
|
):
|
||||||
|
util_batch_unbatch_docs_list(model, in_data, out_data)
|
||||||
|
elif isinstance(out_data, Ragged):
|
||||||
|
util_batch_unbatch_docs_ragged(model, in_data, out_data)
|
||||||
|
|
||||||
|
|
||||||
|
def util_batch_unbatch_docs_list(
|
||||||
|
model: Model[List[Doc], List[Array2d]], in_data: List[Doc], out_data: List[Array2d]
|
||||||
|
):
|
||||||
|
with data_validation(True):
|
||||||
|
model.initialize(in_data, out_data)
|
||||||
|
Y_batched = model.predict(in_data)
|
||||||
|
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||||
|
for i in range(len(Y_batched)):
|
||||||
|
assert_almost_equal(Y_batched[i], Y_not_batched[i], decimal=4)
|
||||||
|
|
||||||
|
|
||||||
|
def util_batch_unbatch_docs_array(
|
||||||
|
model: Model[List[Doc], Array2d], in_data: List[Doc], out_data: Array2d
|
||||||
|
):
|
||||||
|
with data_validation(True):
|
||||||
|
model.initialize(in_data, out_data)
|
||||||
|
Y_batched = model.predict(in_data).tolist()
|
||||||
|
Y_not_batched = [model.predict([u])[0] for u in in_data]
|
||||||
|
assert_almost_equal(Y_batched, Y_not_batched, decimal=4)
|
||||||
|
|
||||||
|
|
||||||
|
def util_batch_unbatch_docs_ragged(
|
||||||
|
model: Model[List[Doc], Ragged], in_data: List[Doc], out_data: Ragged
|
||||||
|
):
|
||||||
|
with data_validation(True):
|
||||||
|
model.initialize(in_data, out_data)
|
||||||
|
Y_batched = model.predict(in_data)
|
||||||
|
Y_not_batched = []
|
||||||
|
for u in in_data:
|
||||||
|
Y_not_batched.extend(model.predict([u]).data.tolist())
|
||||||
|
assert_almost_equal(Y_batched.data, Y_not_batched, decimal=4)
|
|
@ -1,4 +1,5 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
|
@ -6,6 +7,7 @@ from spacy.lang.en import English
|
||||||
from spacy.language import Language
|
from spacy.language import Language
|
||||||
from spacy.tests.util import make_tempdir
|
from spacy.tests.util import make_tempdir
|
||||||
from spacy.morphology import Morphology
|
from spacy.morphology import Morphology
|
||||||
|
from spacy.attrs import MORPH
|
||||||
|
|
||||||
|
|
||||||
def test_label_types():
|
def test_label_types():
|
||||||
|
@ -101,3 +103,16 @@ def test_overfitting_IO():
|
||||||
doc2 = nlp2(test_text)
|
doc2 = nlp2(test_text)
|
||||||
assert [str(t.morph) for t in doc2] == gold_morphs
|
assert [str(t.morph) for t in doc2] == gold_morphs
|
||||||
assert [t.pos_ for t in doc2] == gold_pos_tags
|
assert [t.pos_ for t in doc2] == gold_pos_tags
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"Then one more sentence about London.",
|
||||||
|
"Here is another one.",
|
||||||
|
"I like London.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([MORPH]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.to_array([MORPH]) for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
|
@ -1,4 +1,6 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
from spacy.attrs import SENT_START
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
|
@ -80,3 +82,18 @@ def test_overfitting_IO():
|
||||||
nlp2 = util.load_model_from_path(tmp_dir)
|
nlp2 = util.load_model_from_path(tmp_dir)
|
||||||
doc2 = nlp2(test_text)
|
doc2 = nlp2(test_text)
|
||||||
assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
|
assert [int(t.is_sent_start) for t in doc2] == gold_sent_starts
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"Then one more sentence about London.",
|
||||||
|
"Here is another one.",
|
||||||
|
"I like London.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([SENT_START]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [
|
||||||
|
doc.to_array([SENT_START]) for doc in [nlp(text) for text in texts]
|
||||||
|
]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
|
@ -1,4 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
|
from numpy.testing import assert_equal
|
||||||
|
from spacy.attrs import TAG
|
||||||
|
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.training import Example
|
from spacy.training import Example
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -117,6 +120,19 @@ def test_overfitting_IO():
|
||||||
assert doc2[2].tag_ is "J"
|
assert doc2[2].tag_ is "J"
|
||||||
assert doc2[3].tag_ is "N"
|
assert doc2[3].tag_ is "N"
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = [
|
||||||
|
"Just a sentence.",
|
||||||
|
"I like green eggs.",
|
||||||
|
"Here is another one.",
|
||||||
|
"I eat ham.",
|
||||||
|
]
|
||||||
|
batch_deps_1 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.to_array([TAG]) for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.to_array([TAG]) for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
def test_tagger_requires_labels():
|
def test_tagger_requires_labels():
|
||||||
nlp = English()
|
nlp = English()
|
||||||
|
|
|
@ -1,6 +1,7 @@
|
||||||
import pytest
|
import pytest
|
||||||
import random
|
import random
|
||||||
import numpy.random
|
import numpy.random
|
||||||
|
from numpy.testing import assert_equal
|
||||||
from thinc.api import fix_random_seed
|
from thinc.api import fix_random_seed
|
||||||
from spacy import util
|
from spacy import util
|
||||||
from spacy.lang.en import English
|
from spacy.lang.en import English
|
||||||
|
@ -174,6 +175,14 @@ def test_overfitting_IO():
|
||||||
assert scores["cats_score"] == 1.0
|
assert scores["cats_score"] == 1.0
|
||||||
assert "cats_score_desc" in scores
|
assert "cats_score_desc" in scores
|
||||||
|
|
||||||
|
# Make sure that running pipe twice, or comparing to call, always amounts to the same predictions
|
||||||
|
texts = ["Just a sentence.", "I like green eggs.", "I am happy.", "I eat ham."]
|
||||||
|
batch_deps_1 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
|
batch_deps_2 = [doc.cats for doc in nlp.pipe(texts)]
|
||||||
|
no_batch_deps = [doc.cats for doc in [nlp(text) for text in texts]]
|
||||||
|
assert_equal(batch_deps_1, batch_deps_2)
|
||||||
|
assert_equal(batch_deps_1, no_batch_deps)
|
||||||
|
|
||||||
|
|
||||||
# fmt: off
|
# fmt: off
|
||||||
@pytest.mark.parametrize(
|
@pytest.mark.parametrize(
|
||||||
|
|
76
spacy/tests/regression/test_issue5501-6000.py
Normal file
76
spacy/tests/regression/test_issue5501-6000.py
Normal file
|
@ -0,0 +1,76 @@
|
||||||
|
from thinc.api import fix_random_seed
|
||||||
|
from spacy.lang.en import English
|
||||||
|
from spacy.tokens import Span
|
||||||
|
from spacy import displacy
|
||||||
|
from spacy.pipeline import merge_entities
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue5551():
|
||||||
|
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
||||||
|
component = "textcat"
|
||||||
|
pipe_cfg = {
|
||||||
|
"model": {
|
||||||
|
"@architectures": "spacy.TextCatBOW.v1",
|
||||||
|
"exclusive_classes": True,
|
||||||
|
"ngram_size": 2,
|
||||||
|
"no_output_layer": False,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
results = []
|
||||||
|
for i in range(3):
|
||||||
|
fix_random_seed(0)
|
||||||
|
nlp = English()
|
||||||
|
example = (
|
||||||
|
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
|
||||||
|
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
|
||||||
|
)
|
||||||
|
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
|
||||||
|
for label in set(example[1]["cats"]):
|
||||||
|
pipe.add_label(label)
|
||||||
|
nlp.initialize()
|
||||||
|
# Store the result of each iteration
|
||||||
|
result = pipe.model.predict([nlp.make_doc(example[0])])
|
||||||
|
results.append(list(result[0]))
|
||||||
|
# All results should be the same because of the fixed seed
|
||||||
|
assert len(results) == 3
|
||||||
|
assert results[0] == results[1]
|
||||||
|
assert results[0] == results[2]
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue5838():
|
||||||
|
# Displacy's EntityRenderer break line
|
||||||
|
# not working after last entity
|
||||||
|
sample_text = "First line\nSecond line, with ent\nThird line\nFourth line\n"
|
||||||
|
nlp = English()
|
||||||
|
doc = nlp(sample_text)
|
||||||
|
doc.ents = [Span(doc, 7, 8, label="test")]
|
||||||
|
html = displacy.render(doc, style="ent")
|
||||||
|
found = html.count("</br>")
|
||||||
|
assert found == 4
|
||||||
|
|
||||||
|
|
||||||
|
def test_issue5918():
|
||||||
|
# Test edge case when merging entities.
|
||||||
|
nlp = English()
|
||||||
|
ruler = nlp.add_pipe("entity_ruler")
|
||||||
|
patterns = [
|
||||||
|
{"label": "ORG", "pattern": "Digicon Inc"},
|
||||||
|
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
|
||||||
|
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
|
||||||
|
]
|
||||||
|
ruler.add_patterns(patterns)
|
||||||
|
|
||||||
|
text = """
|
||||||
|
Digicon Inc said it has completed the previously-announced disposition
|
||||||
|
of its computer systems division to an investment group led by
|
||||||
|
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
|
||||||
|
"""
|
||||||
|
doc = nlp(text)
|
||||||
|
assert len(doc.ents) == 3
|
||||||
|
# make it so that the third span's head is within the entity (ent_iob=I)
|
||||||
|
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
|
||||||
|
# TODO: test for logging here
|
||||||
|
# with pytest.warns(UserWarning):
|
||||||
|
# doc[29].head = doc[33]
|
||||||
|
doc = merge_entities(doc)
|
||||||
|
assert len(doc.ents) == 3
|
|
@ -1,37 +0,0 @@
|
||||||
from spacy.lang.en import English
|
|
||||||
from spacy.util import fix_random_seed
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue5551():
|
|
||||||
"""Test that after fixing the random seed, the results of the pipeline are truly identical"""
|
|
||||||
component = "textcat"
|
|
||||||
pipe_cfg = {
|
|
||||||
"model": {
|
|
||||||
"@architectures": "spacy.TextCatBOW.v1",
|
|
||||||
"exclusive_classes": True,
|
|
||||||
"ngram_size": 2,
|
|
||||||
"no_output_layer": False,
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
results = []
|
|
||||||
for i in range(3):
|
|
||||||
fix_random_seed(0)
|
|
||||||
nlp = English()
|
|
||||||
example = (
|
|
||||||
"Once hot, form ping-pong-ball-sized balls of the mixture, each weighing roughly 25 g.",
|
|
||||||
{"cats": {"Labe1": 1.0, "Label2": 0.0, "Label3": 0.0}},
|
|
||||||
)
|
|
||||||
pipe = nlp.add_pipe(component, config=pipe_cfg, last=True)
|
|
||||||
for label in set(example[1]["cats"]):
|
|
||||||
pipe.add_label(label)
|
|
||||||
nlp.initialize()
|
|
||||||
|
|
||||||
# Store the result of each iteration
|
|
||||||
result = pipe.model.predict([nlp.make_doc(example[0])])
|
|
||||||
results.append(list(result[0]))
|
|
||||||
|
|
||||||
# All results should be the same because of the fixed seed
|
|
||||||
assert len(results) == 3
|
|
||||||
assert results[0] == results[1]
|
|
||||||
assert results[0] == results[2]
|
|
|
@ -1,23 +0,0 @@
|
||||||
from spacy.lang.en import English
|
|
||||||
from spacy.tokens import Span
|
|
||||||
from spacy import displacy
|
|
||||||
|
|
||||||
|
|
||||||
SAMPLE_TEXT = """First line
|
|
||||||
Second line, with ent
|
|
||||||
Third line
|
|
||||||
Fourth line
|
|
||||||
"""
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue5838():
|
|
||||||
# Displacy's EntityRenderer break line
|
|
||||||
# not working after last entity
|
|
||||||
|
|
||||||
nlp = English()
|
|
||||||
doc = nlp(SAMPLE_TEXT)
|
|
||||||
doc.ents = [Span(doc, 7, 8, label="test")]
|
|
||||||
|
|
||||||
html = displacy.render(doc, style="ent")
|
|
||||||
found = html.count("</br>")
|
|
||||||
assert found == 4
|
|
|
@ -1,29 +0,0 @@
|
||||||
from spacy.lang.en import English
|
|
||||||
from spacy.pipeline import merge_entities
|
|
||||||
|
|
||||||
|
|
||||||
def test_issue5918():
|
|
||||||
# Test edge case when merging entities.
|
|
||||||
nlp = English()
|
|
||||||
ruler = nlp.add_pipe("entity_ruler")
|
|
||||||
patterns = [
|
|
||||||
{"label": "ORG", "pattern": "Digicon Inc"},
|
|
||||||
{"label": "ORG", "pattern": "Rotan Mosle Inc's"},
|
|
||||||
{"label": "ORG", "pattern": "Rotan Mosle Technology Partners Ltd"},
|
|
||||||
]
|
|
||||||
ruler.add_patterns(patterns)
|
|
||||||
|
|
||||||
text = """
|
|
||||||
Digicon Inc said it has completed the previously-announced disposition
|
|
||||||
of its computer systems division to an investment group led by
|
|
||||||
Rotan Mosle Inc's Rotan Mosle Technology Partners Ltd affiliate.
|
|
||||||
"""
|
|
||||||
doc = nlp(text)
|
|
||||||
assert len(doc.ents) == 3
|
|
||||||
# make it so that the third span's head is within the entity (ent_iob=I)
|
|
||||||
# bug #5918 would wrongly transfer that I to the full entity, resulting in 2 instead of 3 final ents.
|
|
||||||
# TODO: test for logging here
|
|
||||||
# with pytest.warns(UserWarning):
|
|
||||||
# doc[29].head = doc[33]
|
|
||||||
doc = merge_entities(doc)
|
|
||||||
assert len(doc.ents) == 3
|
|
|
@ -20,7 +20,8 @@ def docs_to_json(docs, doc_id=0, ner_missing_tag="O"):
|
||||||
docs = [docs]
|
docs = [docs]
|
||||||
json_doc = {"id": doc_id, "paragraphs": []}
|
json_doc = {"id": doc_id, "paragraphs": []}
|
||||||
for i, doc in enumerate(docs):
|
for i, doc in enumerate(docs):
|
||||||
json_para = {'raw': doc.text, "sentences": [], "cats": [], "entities": [], "links": []}
|
raw = None if doc.has_unknown_spaces else doc.text
|
||||||
|
json_para = {'raw': raw, "sentences": [], "cats": [], "entities": [], "links": []}
|
||||||
for cat, val in doc.cats.items():
|
for cat, val in doc.cats.items():
|
||||||
json_cat = {"label": cat, "value": val}
|
json_cat = {"label": cat, "value": val}
|
||||||
json_para["cats"].append(json_cat)
|
json_para["cats"].append(json_cat)
|
||||||
|
|
|
@ -112,10 +112,10 @@ def train(
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
else:
|
else:
|
||||||
nlp.to_disk(final_model_path)
|
nlp.to_disk(final_model_path)
|
||||||
# This will only run if we don't hit an error
|
# This will only run if we don't hit an error
|
||||||
stdout.write(
|
stdout.write(
|
||||||
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
msg.good("Saved pipeline to output directory", final_model_path) + "\n"
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
def train_while_improving(
|
def train_while_improving(
|
||||||
|
|
|
@ -1,19 +1,18 @@
|
||||||
import { Help } from 'components/typography'; import Link from 'components/link'
|
import { Help } from 'components/typography'; import Link from 'components/link'
|
||||||
|
|
||||||
<!-- TODO: update speed and v2 NER numbers -->
|
|
||||||
|
|
||||||
<figure>
|
<figure>
|
||||||
|
|
||||||
| Pipeline | Parser | Tagger | NER | WPS<br />CPU <Help>words per second on CPU, higher is better</Help> | WPS<br/>GPU <Help>words per second on GPU, higher is better</Help> |
|
| Pipeline | Parser | Tagger | NER |
|
||||||
| ---------------------------------------------------------- | -----: | -----: | ---: | ------------------------------------------------------------------: | -----------------------------------------------------------------: |
|
| ---------------------------------------------------------- | -----: | -----: | ---: |
|
||||||
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.7 | 1k | 8k |
|
| [`en_core_web_trf`](/models/en#en_core_web_trf) (spaCy v3) | 95.5 | 98.3 | 89.4 |
|
||||||
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.8 | 7k | |
|
| [`en_core_web_lg`](/models/en#en_core_web_lg) (spaCy v3) | 92.2 | 97.4 | 85.4 |
|
||||||
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | | 10k | |
|
| `en_core_web_lg` (spaCy v2) | 91.9 | 97.2 | 85.5 |
|
||||||
|
|
||||||
<figcaption class="caption">
|
<figcaption class="caption">
|
||||||
|
|
||||||
**Full pipeline accuracy and speed** on the
|
**Full pipeline accuracy and speed** on the
|
||||||
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus.
|
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) corpus (reported on
|
||||||
|
the development set).
|
||||||
|
|
||||||
</figcaption>
|
</figcaption>
|
||||||
|
|
||||||
|
@ -21,14 +20,11 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
||||||
|
|
||||||
<figure>
|
<figure>
|
||||||
|
|
||||||
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
| Named Entity Recognition System | OntoNotes | CoNLL '03 |
|
||||||
| ------------------------------------------------------------------------------ | --------: | --------: |
|
| -------------------------------- | --------: | --------: |
|
||||||
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
|
| spaCy RoBERTa (2020) | 89.7 | 91.6 |
|
||||||
| spaCy CNN (2020) | 84.5 | |
|
| Stanza (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
||||||
| spaCy CNN (2017) | | |
|
| Flair<sup>2</sup> | 89.7 | 93.1 |
|
||||||
| [Stanza](https://stanfordnlp.github.io/stanza/) (StanfordNLP)<sup>1</sup> | 88.8 | 92.1 |
|
|
||||||
| <Link to="https://github.com/flairNLP/flair" hideIcon>Flair</Link><sup>2</sup> | 89.7 | 93.1 |
|
|
||||||
| BERT Base<sup>3</sup> | - | 92.4 |
|
|
||||||
|
|
||||||
<figcaption class="caption">
|
<figcaption class="caption">
|
||||||
|
|
||||||
|
@ -36,9 +32,10 @@ import { Help } from 'components/typography'; import Link from 'components/link'
|
||||||
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
|
[OntoNotes 5.0](https://catalog.ldc.upenn.edu/LDC2013T19) and
|
||||||
[CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
|
[CoNLL-2003](https://www.aclweb.org/anthology/W03-0419.pdf) corpora. See
|
||||||
[NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
|
[NLP-progress](http://nlpprogress.com/english/named_entity_recognition.html) for
|
||||||
more results. **1. ** [Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf).
|
more results. Project template:
|
||||||
**2. ** [Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/). **3.
|
[`benchmarks/ner_conll03`](%%GITHUB_PROJECTS/benchmarks/ner_conll03). **1. **
|
||||||
** [Devlin et al. (2018)](https://arxiv.org/abs/1810.04805).
|
[Qi et al. (2020)](https://arxiv.org/pdf/2003.07082.pdf). **2. **
|
||||||
|
[Akbik et al. (2018)](https://www.aclweb.org/anthology/C18-1139/).
|
||||||
|
|
||||||
</figcaption>
|
</figcaption>
|
||||||
|
|
||||||
|
|
|
@ -10,6 +10,18 @@ menu:
|
||||||
|
|
||||||
## Comparison {#comparison hidden="true"}
|
## Comparison {#comparison hidden="true"}
|
||||||
|
|
||||||
|
spaCy is a **free, open-source library** for advanced **Natural Language
|
||||||
|
Processing** (NLP) in Python. It's designed specifically for **production use**
|
||||||
|
and helps you build applications that process and "understand" large volumes of
|
||||||
|
text. It can be used to build information extraction or natural language
|
||||||
|
understanding systems.
|
||||||
|
|
||||||
|
### Feature overview {#comparison-features}
|
||||||
|
|
||||||
|
import Features from 'widgets/features.js'
|
||||||
|
|
||||||
|
<Features />
|
||||||
|
|
||||||
### When should I use spaCy? {#comparison-usage}
|
### When should I use spaCy? {#comparison-usage}
|
||||||
|
|
||||||
- ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
|
- ✅ **I'm a beginner and just getting started with NLP.** – spaCy makes it easy
|
||||||
|
@ -65,8 +77,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
||||||
|
|
||||||
| Dependency Parsing System | UAS | LAS |
|
| Dependency Parsing System | UAS | LAS |
|
||||||
| ------------------------------------------------------------------------------ | ---: | ---: |
|
| ------------------------------------------------------------------------------ | ---: | ---: |
|
||||||
| spaCy RoBERTa (2020)<sup>1</sup> | 95.5 | 94.3 |
|
| spaCy RoBERTa (2020) | 95.5 | 94.3 |
|
||||||
| spaCy CNN (2020)<sup>1</sup> | | |
|
|
||||||
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
| [Mrini et al.](https://khalilmrini.github.io/Label_Attention_Layer.pdf) (2019) | 97.4 | 96.3 |
|
||||||
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
| [Zhou and Zhao](https://www.aclweb.org/anthology/P19-1230/) (2019) | 97.2 | 95.7 |
|
||||||
|
|
||||||
|
@ -74,7 +85,7 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
||||||
|
|
||||||
**Dependency parsing accuracy** on the Penn Treebank. See
|
**Dependency parsing accuracy** on the Penn Treebank. See
|
||||||
[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
|
[NLP-progress](http://nlpprogress.com/english/dependency_parsing.html) for more
|
||||||
results. **1. ** Project template:
|
results. Project template:
|
||||||
[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
|
[`benchmarks/parsing_penn_treebank`](%%GITHUB_PROJECTS/benchmarks/parsing_penn_treebank).
|
||||||
|
|
||||||
</figcaption>
|
</figcaption>
|
||||||
|
|
|
@ -489,11 +489,11 @@ This allows you to write callbacks that consider the entire set of matched
|
||||||
phrases, so that you can resolve overlaps and other conflicts in whatever way
|
phrases, so that you can resolve overlaps and other conflicts in whatever way
|
||||||
you prefer.
|
you prefer.
|
||||||
|
|
||||||
| Argument | Description |
|
| Argument | Description |
|
||||||
| --------- | -------------------------------------------------------------------------------------------------------------------------------------------------- |
|
| --------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||||
| `matcher` | The matcher instance. ~~Matcher~~ |
|
| `matcher` | The matcher instance. ~~Matcher~~ |
|
||||||
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
| `doc` | The document the matcher was used on. ~~Doc~~ |
|
||||||
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
| `i` | Index of the current match (`matches[i`]). ~~int~~ |
|
||||||
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
|
| `matches` | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. ~~List[Tuple[int, int int]]~~ |
|
||||||
|
|
||||||
### Creating spans from matches {#matcher-spans}
|
### Creating spans from matches {#matcher-spans}
|
||||||
|
@ -631,8 +631,8 @@ To get a quick overview of the results, you could collect all sentences
|
||||||
containing a match and render them with the
|
containing a match and render them with the
|
||||||
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
[displaCy visualizer](/usage/visualizers). In the callback function, you'll have
|
||||||
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
access to the `start` and `end` of each match, as well as the parent `Doc`. This
|
||||||
lets you determine the sentence containing the match, `doc[start:end].sent`,
|
lets you determine the sentence containing the match, `doc[start:end].sent`, and
|
||||||
and calculate the start and end of the matched span within the sentence. Using
|
calculate the start and end of the matched span within the sentence. Using
|
||||||
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
displaCy in ["manual" mode](/usage/visualizers#manual-usage) lets you pass in a
|
||||||
list of dictionaries containing the text and entities to render.
|
list of dictionaries containing the text and entities to render.
|
||||||
|
|
||||||
|
|
|
@ -77,6 +77,26 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
||||||
|
|
||||||
<Benchmarks />
|
<Benchmarks />
|
||||||
|
|
||||||
|
#### New trained transformer-based pipelines {#features-transformers-pipelines}
|
||||||
|
|
||||||
|
> #### Notes on model capabilities
|
||||||
|
>
|
||||||
|
> The models are each trained with a **single transformer** shared across the
|
||||||
|
> pipeline, which requires it to be trained on a single corpus. For
|
||||||
|
> [English](/models/en) and [Chinese](/models/zh), we used the OntoNotes 5
|
||||||
|
> corpus, which has annotations across several tasks. For [French](/models/fr),
|
||||||
|
> [Spanish](/models/es) and [German](/models/de), we didn't have a suitable
|
||||||
|
> corpus that had both syntactic and entity annotations, so the transformer
|
||||||
|
> models for those languages do not include NER.
|
||||||
|
|
||||||
|
| Package | Language | Transformer | Tagger | Parser | NER |
|
||||||
|
| ------------------------------------------------ | -------- | --------------------------------------------------------------------------------------------- | -----: | -----: | ---: |
|
||||||
|
| [`en_core_web_trf`](/models/en#en_core_web_trf) | English | [`roberta-base`](https://huggingface.co/roberta-base) | 97.8 | 95.0 | 89.4 |
|
||||||
|
| [`de_dep_news_trf`](/models/de#de_dep_news_trf) | German | [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased) | 99.0 | 95.8 | - |
|
||||||
|
| [`es_dep_news_trf`](/models/es#es_dep_news_trf) | Spanish | [`bert-base-spanish-wwm-cased`](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) | 98.2 | 94.6 | - |
|
||||||
|
| [`fr_dep_news_trf`](/models/fr#fr_dep_news_trf) | French | [`camembert-base`](https://huggingface.co/camembert-base) | 95.7 | 94.9 | - |
|
||||||
|
| [`zh_core_web_trf`](/models/zh#zh_core_news_trf) | Chinese | [`bert-base-chinese`](https://huggingface.co/bert-base-chinese) | 92.5 | 77.2 | 75.6 |
|
||||||
|
|
||||||
<Infobox title="Details & Documentation" emoji="📖" list>
|
<Infobox title="Details & Documentation" emoji="📖" list>
|
||||||
|
|
||||||
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
|
- **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||||||
|
@ -88,11 +108,6 @@ import Benchmarks from 'usage/\_benchmarks-models.md'
|
||||||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||||||
[TransformerListener](/api/architectures#TransformerListener),
|
[TransformerListener](/api/architectures#TransformerListener),
|
||||||
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||||||
- **Trained Pipelines:** [`en_core_web_trf`](/models/en#en_core_web_trf),
|
|
||||||
[`de_dep_news_trf`](/models/de#de_dep_news_trf),
|
|
||||||
[`es_dep_news_trf`](/models/es#es_dep_news_trf),
|
|
||||||
[`fr_dep_news_trf`](/models/fr#fr_dep_news_trf),
|
|
||||||
[`zh_core_web_trf`](/models/zh#zh_core_web_trf)
|
|
||||||
- **Implementation:**
|
- **Implementation:**
|
||||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||||||
|
|
||||||
|
|
72
website/src/widgets/features.js
Normal file
72
website/src/widgets/features.js
Normal file
|
@ -0,0 +1,72 @@
|
||||||
|
import React from 'react'
|
||||||
|
import { graphql, StaticQuery } from 'gatsby'
|
||||||
|
|
||||||
|
import { Ul, Li } from '../components/list'
|
||||||
|
|
||||||
|
export default () => (
|
||||||
|
<StaticQuery
|
||||||
|
query={query}
|
||||||
|
render={({ site }) => {
|
||||||
|
const { counts } = site.siteMetadata
|
||||||
|
return (
|
||||||
|
<Ul>
|
||||||
|
<Li>
|
||||||
|
✅ Support for <strong>{counts.langs}+ languages</strong>
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ <strong>{counts.models} trained pipelines</strong> for{' '}
|
||||||
|
{counts.modelLangs} languages
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Multi-task learning with pretrained <strong>transformers</strong> like
|
||||||
|
BERT
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Pretrained <strong>word vectors</strong>
|
||||||
|
</Li>
|
||||||
|
<Li>✅ State-of-the-art speed</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Production-ready <strong>training system</strong>
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Linguistically-motivated <strong>tokenization</strong>
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Components for <strong>named entity</strong> recognition, part-of-speech
|
||||||
|
tagging, dependency parsing, sentence segmentation,{' '}
|
||||||
|
<strong>text classification</strong>, lemmatization, morphological analysis,
|
||||||
|
entity linking and more
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Easily extensible with <strong>custom components</strong> and attributes
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Support for custom models in <strong>PyTorch</strong>,{' '}
|
||||||
|
<strong>TensorFlow</strong> and other frameworks
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Built in <strong>visualizers</strong> for syntax and NER
|
||||||
|
</Li>
|
||||||
|
<Li>
|
||||||
|
✅ Easy <strong>model packaging</strong>, deployment and workflow management
|
||||||
|
</Li>
|
||||||
|
<Li>✅ Robust, rigorously evaluated accuracy</Li>
|
||||||
|
</Ul>
|
||||||
|
)
|
||||||
|
}}
|
||||||
|
/>
|
||||||
|
)
|
||||||
|
|
||||||
|
const query = graphql`
|
||||||
|
query FeaturesQuery {
|
||||||
|
site {
|
||||||
|
siteMetadata {
|
||||||
|
counts {
|
||||||
|
langs
|
||||||
|
modelLangs
|
||||||
|
models
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
`
|
|
@ -14,13 +14,13 @@ import {
|
||||||
LandingBanner,
|
LandingBanner,
|
||||||
} from '../components/landing'
|
} from '../components/landing'
|
||||||
import { H2 } from '../components/typography'
|
import { H2 } from '../components/typography'
|
||||||
import { Ul, Li } from '../components/list'
|
|
||||||
import { InlineCode } from '../components/code'
|
import { InlineCode } from '../components/code'
|
||||||
import Button from '../components/button'
|
import Button from '../components/button'
|
||||||
import Link from '../components/link'
|
import Link from '../components/link'
|
||||||
|
|
||||||
import QuickstartTraining from './quickstart-training'
|
import QuickstartTraining from './quickstart-training'
|
||||||
import Project from './project'
|
import Project from './project'
|
||||||
|
import Features from './features'
|
||||||
import courseImage from '../../docs/images/course.jpg'
|
import courseImage from '../../docs/images/course.jpg'
|
||||||
import prodigyImage from '../../docs/images/prodigy_overview.jpg'
|
import prodigyImage from '../../docs/images/prodigy_overview.jpg'
|
||||||
import projectsImage from '../../docs/images/projects.png'
|
import projectsImage from '../../docs/images/projects.png'
|
||||||
|
@ -56,7 +56,7 @@ for entity in doc.ents:
|
||||||
}
|
}
|
||||||
|
|
||||||
const Landing = ({ data }) => {
|
const Landing = ({ data }) => {
|
||||||
const { counts, nightly } = data
|
const { nightly } = data
|
||||||
const codeExample = getCodeExample(nightly)
|
const codeExample = getCodeExample(nightly)
|
||||||
return (
|
return (
|
||||||
<>
|
<>
|
||||||
|
@ -98,51 +98,7 @@ const Landing = ({ data }) => {
|
||||||
|
|
||||||
<LandingCol>
|
<LandingCol>
|
||||||
<H2>Features</H2>
|
<H2>Features</H2>
|
||||||
<Ul>
|
<Features />
|
||||||
<Li>
|
|
||||||
✅ Support for <strong>{counts.langs}+ languages</strong>
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ <strong>{counts.models} trained pipelines</strong> for{' '}
|
|
||||||
{counts.modelLangs} languages
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Multi-task learning with pretrained <strong>transformers</strong>{' '}
|
|
||||||
like BERT
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Pretrained <strong>word vectors</strong>
|
|
||||||
</Li>
|
|
||||||
<Li>✅ State-of-the-art speed</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Production-ready <strong>training system</strong>
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Linguistically-motivated <strong>tokenization</strong>
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Components for <strong>named entity</strong> recognition,
|
|
||||||
part-of-speech tagging, dependency parsing, sentence segmentation,{' '}
|
|
||||||
<strong>text classification</strong>, lemmatization, morphological
|
|
||||||
analysis, entity linking and more
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Easily extensible with <strong>custom components</strong> and
|
|
||||||
attributes
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Support for custom models in <strong>PyTorch</strong>,{' '}
|
|
||||||
<strong>TensorFlow</strong> and other frameworks
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Built in <strong>visualizers</strong> for syntax and NER
|
|
||||||
</Li>
|
|
||||||
<Li>
|
|
||||||
✅ Easy <strong>model packaging</strong>, deployment and workflow
|
|
||||||
management
|
|
||||||
</Li>
|
|
||||||
<Li>✅ Robust, rigorously evaluated accuracy</Li>
|
|
||||||
</Ul>
|
|
||||||
</LandingCol>
|
</LandingCol>
|
||||||
</LandingGrid>
|
</LandingGrid>
|
||||||
|
|
||||||
|
@ -333,11 +289,6 @@ const landingQuery = graphql`
|
||||||
siteMetadata {
|
siteMetadata {
|
||||||
nightly
|
nightly
|
||||||
repo
|
repo
|
||||||
counts {
|
|
||||||
langs
|
|
||||||
modelLangs
|
|
||||||
models
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
Loading…
Reference in New Issue
Block a user