Update v2.3.x branch (#5636)

* Fix typos and auto-format [ci skip]

* Add pkuseg warnings and auto-format [ci skip]

* Update Binder URL [ci skip]

* Update Binder version [ci skip]

* Update alignment example for new gold.align

* Update POS in tagging example

* Fix numpy.zeros() dtype for Doc.from_array

* Change example title to Dr.

Change example title to Dr. so the current model does exclude the title
in the initial example.

* Fix spacy convert argument

* Warning for sudachipy 0.4.5 (#5611)

* Create myavrum.md (#5612)

* Update lex_attrs.py (#5608)

* Create mahnerak.md (#5615)

* Some changes for Armenian (#5616)

* Fixing numericals

* We need a Armenian question sign to make the sentence a question

* Add Nepali Language  (#5622)

* added support for nepali lang

* added examples and test files

* added spacy contributor agreement

* Japanese model: add user_dict entries and small refactor (#5573)

* user_dict fields: adding inflections, reading_forms, sub_tokens
deleting: unidic_tags
improve code readability around the token alignment procedure

* add test cases, replace fugashi with sudachipy in conftest

* move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer

* tag is space -> both surface and tag are spaces

* consider len(text)==0

* Add warnings example in v2.3 migration guide (#5627)

* contribute (#5632)

* Fix polarity of Token.is_oov and Lexeme.is_oov (#5634)

Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the
lexeme does **not** have a vector.

* Extend what's new in v2.3 with vocab / is_oov (#5635)

* Skip vocab in component config overrides (#5624)

* Fix backslashes in warnings config diff (#5640)

Fix backslashes in warnings config diff in v2.3 migration section.

* Disregard special tag  _SP in check for new tag map (#5641)

* Skip special tag  _SP in check for new tag map

In `Tagger.begin_training()` check for new tags aside from `_SP` in the
new tag map initialized from the provided gold tuples when determining
whether to reinitialize the morphology with the new tag map.

* Simplify _SP check

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am>
Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com>
Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com>
Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
This commit is contained in:
Adriane Boyd 2020-06-29 14:13:12 +02:00 committed by GitHub
parent e9d3e177f0
commit f42c9026f5
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
31 changed files with 1458 additions and 365 deletions

106
.github/contributors/mahnerak.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Karen Hambardzumyan |
| Company name (if applicable) | YerevaNN |
| Title or role (if applicable) | Researcher |
| Date | 2020-06-19 |
| GitHub username | mahnerak |
| Website (optional) | https://mahnerak.com/|

106
.github/contributors/myavrum.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Marat M. Yavrumyan |
| Company name (if applicable) | YSU, UD_Armenian Project |
| Title or role (if applicable) | Dr., Principal Investigator |
| Date | 2020-06-19 |
| GitHub username | myavrum |
| Website (optional) | http://armtreebank.yerevann.com/ |

106
.github/contributors/rameshhpathak.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Ramesh Pathak |
| Company name (if applicable) | Diyo AI |
| Title or role (if applicable) | AI Engineer |
| Date | June 21, 2020 |
| GitHub username | rameshhpathak |
| Website (optional) |rameshhpathak.github.io| |

106
.github/contributors/richardliaw.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Richard Liaw |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 06/22/2020 |
| GitHub username | richardliaw |
| Website (optional) | |

View File

@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models.
sentences = [
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
"Ո՞վ է Ֆրանսիայի նախագահը։",
"Որն է Միացյալ Նահանգների մայրաքաղաքը։",
"Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
"Ե՞րբ է ծնվել Բարաք Օբաման։",
]

View File

@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM
_num_words = [
"զրօ",
"մէկ",
"զրո",
"մեկ",
"երկու",
"երեք",
"չորս",
@ -18,20 +18,21 @@ _num_words = [
"տասը",
"տասնմեկ",
"տասներկու",
"տասն­երեք",
"տասն­չորս",
"տասն­հինգ",
"տասն­վեց",
"տասն­յոթ",
"տասն­ութ",
"տասն­ինը",
"քսան" "երեսուն",
"տասներեք",
"տասնչորս",
"տասնհինգ",
"տասնվեց",
"տասնյոթ",
"տասնութ",
"տասնինը",
"քսան",
"երեսուն",
"քառասուն",
"հիսուն",
"վաթցսուն",
"վաթսուն",
"յոթանասուն",
"ութսուն",
"ինիսուն",
"իննսուն",
"հարյուր",
"հազար",
"միլիոն",

View File

@ -20,12 +20,7 @@ from ... import util
# Hold the attributes we need with convenient names
DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
# Handling for multiple spaces in a row is somewhat awkward, this simplifies
# the flow by creating a dummy with the same interface.
DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
DummySpace = DummyNode(" ", " ", " ")
DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"])
def try_sudachi_import(split_mode="A"):
@ -53,7 +48,7 @@ def try_sudachi_import(split_mode="A"):
)
def resolve_pos(orth, pos, next_pos):
def resolve_pos(orth, tag, next_tag):
"""If necessary, add a field to the POS tag for UD mapping.
Under Universal Dependencies, sometimes the same Unidic POS tag can
be mapped differently depending on the literal token or its context
@ -64,124 +59,77 @@ def resolve_pos(orth, pos, next_pos):
# Some tokens have their UD tag decided based on the POS of the following
# token.
# orth based rules
if pos[0] in TAG_ORTH_MAP:
orth_map = TAG_ORTH_MAP[pos[0]]
# apply orth based mapping
if tag in TAG_ORTH_MAP:
orth_map = TAG_ORTH_MAP[tag]
if orth in orth_map:
return orth_map[orth], None
return orth_map[orth], None # current_pos, next_pos
# tag bi-gram mapping
if next_pos:
tag_bigram = pos[0], next_pos[0]
# apply tag bi-gram mapping
if next_tag:
tag_bigram = tag, next_tag
if tag_bigram in TAG_BIGRAM_MAP:
bipos = TAG_BIGRAM_MAP[tag_bigram]
if bipos[0] is None:
return TAG_MAP[pos[0]][POS], bipos[1]
current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
if current_pos is None: # apply tag uni-gram mapping for current_pos
return TAG_MAP[tag][POS], next_pos # only next_pos is identified by tag bi-gram mapping
else:
return bipos
return current_pos, next_pos
return TAG_MAP[pos[0]][POS], None
# apply tag uni-gram mapping
return TAG_MAP[tag][POS], None
# Use a mapping of paired punctuation to avoid splitting quoted sentences.
pairpunct = {'':'', '': '', '': ''}
def separate_sentences(doc):
"""Given a doc, mark tokens that start sentences based on Unidic tags.
"""
stack = [] # save paired punctuation
for i, token in enumerate(doc[:-2]):
# Set all tokens after the first to false by default. This is necessary
# for the doc code to be aware we've done sentencization, see
# `is_sentenced`.
token.sent_start = (i == 0)
if token.tag_:
if token.tag_ == "補助記号-括弧開":
ts = str(token)
if ts in pairpunct:
stack.append(pairpunct[ts])
elif stack and ts == stack[-1]:
stack.pop()
if token.tag_ == "補助記号-句点":
next_token = doc[i+1]
if next_token.tag_ != token.tag_ and not stack:
next_token.sent_start = True
def get_dtokens(tokenizer, text):
tokens = tokenizer.tokenize(text)
words = []
for ti, token in enumerate(tokens):
tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*'])
inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*'])
dtoken = DetailedToken(
token.surface(),
(tag, inf),
token.dictionary_form())
if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白':
# don't add multiple space tokens in a row
continue
words.append(dtoken)
# remove empty tokens. These can be produced with characters like … that
# Sudachi normalizes internally.
words = [ww for ww in words if len(ww.surface) > 0]
return words
def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
# Compare the content of tokens and text, first
words = [x.surface for x in dtokens]
if "".join("".join(words).split()) != "".join(text.split()):
raise ValueError(Errors.E194.format(text=text, words=words))
text_words = []
text_lemmas = []
text_tags = []
text_dtokens = []
text_spaces = []
text_pos = 0
# handle empty and whitespace-only texts
if len(words) == 0:
return text_words, text_lemmas, text_tags, text_spaces
return text_dtokens, text_spaces
elif len([word for word in words if not word.isspace()]) == 0:
assert text.isspace()
text_words = [text]
text_lemmas = [text]
text_tags = [gap_tag]
text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)]
text_spaces = [False]
return text_words, text_lemmas, text_tags, text_spaces
# normalize words to remove all whitespace tokens
norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()])
# align words with text
for word, dtoken in zip(norm_words, norm_dtokens):
return text_dtokens, text_spaces
# align words and dtokens by referring text, and insert gap tokens for the space char spans
for word, dtoken in zip(words, dtokens):
# skip all space tokens
if word.isspace():
continue
try:
word_start = text[text_pos:].index(word)
except ValueError:
raise ValueError(Errors.E194.format(text=text, words=words))
# space token
if word_start > 0:
w = text[text_pos:text_pos + word_start]
text_words.append(w)
text_lemmas.append(w)
text_tags.append(gap_tag)
text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
text_spaces.append(False)
text_pos += word_start
text_words.append(word)
text_lemmas.append(dtoken.lemma)
text_tags.append(dtoken.pos)
# content word
text_dtokens.append(dtoken)
text_spaces.append(False)
text_pos += len(word)
# poll a space char after the word
if text_pos < len(text) and text[text_pos] == " ":
text_spaces[-1] = True
text_pos += 1
# trailing space token
if text_pos < len(text):
w = text[text_pos:]
text_words.append(w)
text_lemmas.append(w)
text_tags.append(gap_tag)
text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
text_spaces.append(False)
return text_words, text_lemmas, text_tags, text_spaces
return text_dtokens, text_spaces
class JapaneseTokenizer(DummyTokenizer):
@ -191,29 +139,78 @@ class JapaneseTokenizer(DummyTokenizer):
self.tokenizer = try_sudachi_import(self.split_mode)
def __call__(self, text):
dtokens = get_dtokens(self.tokenizer, text)
# convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
sudachipy_tokens = self.tokenizer.tokenize(text)
dtokens = self._get_dtokens(sudachipy_tokens)
dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
# create Doc with tag bi-gram based part-of-speech identification rules
words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6
sub_tokens_list = list(sub_tokens_list)
doc = Doc(self.vocab, words=words, spaces=spaces)
next_pos = None
for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
token.tag_ = unidic_tag[0]
if next_pos:
next_pos = None # for bi-gram rules
for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
token.tag_ = dtoken.tag
if next_pos: # already identified in previous iteration
token.pos = next_pos
next_pos = None
else:
token.pos, next_pos = resolve_pos(
token.orth_,
unidic_tag,
unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None
dtoken.tag,
tags[idx + 1] if idx + 1 < len(tags) else None
)
# if there's no lemma info (it's an unk) just use the surface
token.lemma_ = lemma
doc.user_data["unidic_tags"] = unidic_tags
token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
doc.user_data["inflections"] = inflections
doc.user_data["reading_forms"] = readings
doc.user_data["sub_tokens"] = sub_tokens_list
return doc
def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
dtokens = [
DetailedToken(
token.surface(), # orth
'-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']), # tag
','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']), # inf
token.dictionary_form(), # lemma
token.reading_form(), # user_data['reading_forms']
sub_tokens_list[idx] if sub_tokens_list else None, # user_data['sub_tokens']
) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
# remove empty tokens which can be produced with characters like … that
]
# Sudachi normalizes internally and outputs each space char as a token.
# This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
return [
t for idx, t in enumerate(dtokens) if
idx == 0 or
not t.surface.isspace() or t.tag != '空白' or
not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白'
]
def _get_sub_tokens(self, sudachipy_tokens):
if self.split_mode is None or self.split_mode == "A": # do nothing for default split mode
return None
sub_tokens_list = [] # list of (list of list of DetailedToken | None)
for token in sudachipy_tokens:
sub_a = token.split(self.tokenizer.SplitMode.A)
if len(sub_a) == 1: # no sub tokens
sub_tokens_list.append(None)
elif self.split_mode == "B":
sub_tokens_list.append([self._get_dtokens(sub_a, False)])
else: # "C"
sub_b = token.split(self.tokenizer.SplitMode.B)
if len(sub_a) == len(sub_b):
dtokens = self._get_dtokens(sub_a, False)
sub_tokens_list.append([dtokens, dtokens])
else:
sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)])
return sub_tokens_list
def _get_config(self):
config = OrderedDict(
(

View File

@ -1,144 +0,0 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
POS_PHRASE_MAP = {
"NOUN": "NP",
"NUM": "NP",
"PRON": "NP",
"PROPN": "NP",
"VERB": "VP",
"ADJ": "ADJP",
"ADV": "ADVP",
"CCONJ": "CCONJP",
}
# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
def yield_bunsetu(doc, debug=False):
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
prev = None
prev_tag = None
prev_dep = None
prev_head = None
for t in doc:
pos = t.pos_
pos_type = POS_PHRASE_MAP.get(pos, None)
tag = t.tag_
dep = t.dep_
head = t.head.i
if debug:
print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu)
# DET is always an individual bunsetu
if pos == "DET":
if bunsetu:
yield bunsetu, phrase_type, phrase
yield [t], None, None
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
# PRON or Open PUNCT always splits bunsetu
elif tag == "補助記号-括弧開":
if bunsetu:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = True
phrase_type = None
phrase = None
# bunsetu head not appeared
elif phrase_type is None:
if bunsetu and prev_tag == "補助記号-読点":
yield bunsetu, phrase_type, phrase
bunsetu = []
bunsetu_may_end = False
phrase_type = None
phrase = None
bunsetu.append(t)
if pos_type: # begin phrase
phrase = [t]
phrase_type = pos_type
if pos_type in {"ADVP", "CCONJP"}:
bunsetu_may_end = True
# entering new bunsetu
elif pos_type and (
pos_type != phrase_type or # different phrase type arises
bunsetu_may_end # same phrase type but bunsetu already ended
):
# exceptional case: NOUN to VERB
if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i:
bunsetu.append(t)
phrase_type = "VP"
phrase.append(t)
# exceptional case: VERB to NOUN
elif phrase_type == "VP" and pos_type == "NP" and (
prev_dep == 'compound' and prev_head == t.i or
dep == 'compound' and prev == head or
prev_dep == 'nmod' and prev_head == t.i
):
bunsetu.append(t)
phrase_type = "NP"
phrase.append(t)
else:
yield bunsetu, phrase_type, phrase
bunsetu = [t]
bunsetu_may_end = False
phrase_type = pos_type
phrase = [t]
# NOUN bunsetu
elif phrase_type == "NP":
bunsetu.append(t)
if not bunsetu_may_end and ((
(pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'}
) or (
pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
)):
phrase.append(t)
else:
bunsetu_may_end = True
# VERB bunsetu
elif phrase_type == "VP":
bunsetu.append(t)
if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound':
phrase.append(t)
else:
bunsetu_may_end = True
# ADJ bunsetu
elif phrase_type == "ADJP" and tag != '連体詞':
bunsetu.append(t)
if not bunsetu_may_end and ((
pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'}
) or (
pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
)):
phrase.append(t)
else:
bunsetu_may_end = True
# other bunsetu
else:
bunsetu.append(t)
prev = t.i
prev_tag = t.tag_
prev_dep = t.dep_
prev_head = head
if bunsetu:
yield bunsetu, phrase_type, phrase

23
spacy/lang/ne/__init__.py Normal file
View File

@ -0,0 +1,23 @@
# coding: utf8
from __future__ import unicode_literals
from .stop_words import STOP_WORDS
from .lex_attrs import LEX_ATTRS
from ...language import Language
from ...attrs import LANG
class NepaliDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
stop_words = STOP_WORDS
class Nepali(Language):
lang = "ne"
Defaults = NepaliDefaults
__all__ = ["Nepali"]

22
spacy/lang/ne/examples.py Normal file
View File

@ -0,0 +1,22 @@
# coding: utf8
from __future__ import unicode_literals
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.ne.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
"स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
"स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
"लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
"तिमी कहाँ छौ?",
"फ्रान्स को राष्ट्रपति को हो?",
"संयुक्त राज्यको राजधानी के हो?",
"बराक ओबामा कहिले कहिले जन्मेका हुन्?",
]

View File

@ -0,0 +1,98 @@
# coding: utf8
from __future__ import unicode_literals
from ..norm_exceptions import BASE_NORMS
from ...attrs import NORM, LIKE_NUM
# fmt: off
_stem_suffixes = [
["", "ि", "", "", "", "", "", "", "", ""],
["", "", "", ""],
["लाई", "ले", "बाट", "को", "मा", "हरू"],
["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
["इलो", "िलो", "नु", "ाउनु", "", "इन", "इन्", "इनन्"],
["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "", "एनन्"],
["छु", "छौँ", "छस्", "छौ", "", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
["", "", "", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
["याइ", "ाइ", "बार", "वार", "चाँहि"],
["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "", "", "नन्"]
]
# fmt: on
# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
# reference 2: https://www.imnepal.com/nepali-numbers/
_num_words = [
"शुन्य",
"एक",
"दुई",
"तीन",
"चार",
"पाँच",
"",
"सात",
"आठ",
"नौ",
"दश",
"एघार",
"बाह्र",
"तेह्र",
"चौध",
"पन्ध्र",
"सोह्र",
"सोह्र",
"सत्र",
"अठार",
"उन्नाइस",
"बीस",
"तीस",
"चालीस",
"पचास",
"साठी",
"सत्तरी",
"असी",
"नब्बे",
"सय",
"हजार",
"लाख",
"करोड",
"अर्ब",
"खर्ब",
]
def norm(string):
# normalise base exceptions, e.g. punctuation or currency symbols
if string in BASE_NORMS:
return BASE_NORMS[string]
# set stem word as norm, if available, adapted from:
# https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
# https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
for suffix_group in reversed(_stem_suffixes):
length = len(suffix_group[0])
if len(string) <= length:
break
for suffix in suffix_group:
if string.endswith(suffix):
return string[:-length]
return string
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(", ", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
if text.lower() in _num_words:
return True
return False
LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}

498
spacy/lang/ne/stop_words.py Normal file
View File

@ -0,0 +1,498 @@
# coding: utf8
from __future__ import unicode_literals
# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
STOP_WORDS = set(
"""
अकसर
अगि
अग
अघि
अझ
अठ
अथव
अनि
अन
अनतरगत
अन
अनयत
अनयथ
अब
अर
अर
अर
अर
अर
अर
अलग
अलि
अवस
अहि
आए
आएक
आएक
आज
आजक
आठ
आत
आदि
आदि
आफन
आफ
आफ
आफ
आफ
आफ
आफ
आय
उक
उदहरण
उनक
उनल
उनल
उनि
उन
उनहर
उनइस
उप
उसक
उसल
उसल
उह
एउट
एउट
एक
एकदम
एघ
ओठ
कत
कति
कत
कम
कमसकम
कसरि
कसर
कस
कस
कस
कस
कस
कस
कह
कहि
रण
ि
ि
िनभन
पय
ि
ि
िपनि
पनि
रमश
गए
गएक
गएर
गय
गरि
गर
गर
गर
गर
गर
गर
गर
गरछन
गर
गर
गर
गर
गर
गरपर
गर
घर
हन
हन
ि
ि
ि
छन
छन
नन
जत
जततत
जन
जन
जन
जन
जब
जबकि
जबक
जसक
जसब
जसम
जसर
जसल
जसल
जस
जस
जस
जस
जह
ि
पनि
पन
तत
तत
तथ
तथि
तथ
तदन
तप
तप
तपईक
तब
तर
तर
तल
तसर
पनि
पन
ि
िि
ििहर
ि
िहर
िहर
िहर
िहर
ि
ि
ि
िरक
रन
रण
पनि
पन
यति
यति
यस
यसकरण
यसक
यसल
यस
यस
यस
यस
यह
यहि
यह
यह
यह
सपछि
थप
थरि
थर
ि
ि
िएन
ि
दर
दश
ि
िएक
ि
िभएक
ि
इवट
ि
ि
ि
धन
नगर
नगर
नजि
नत
नतरभन
नभई
नभएक
नभन
नय
ि
ि
िि
ि
ि
िि
पक
पक
पछि
पछ
पछि
पछि
पछ
पटक
पनि
पन
पर
पर
पर
पर
पर
पर
पहि
पहि
पहि
ि
रति
रत
रतयक
लस
फरक
ि
बढ
बत
बन
बर
ि
ि
िचम
ि
ि
ि
चम
भए
भए
भएक
भएक
भएक
भएन
भएर
भन
भन
भन
भन
भन
भनछन
भन
भन
भन
भनभय
भन
भन
भय
भय
भर
भरि
भर
ि
ि
मध
मध
मल
ि
ि
ि
यति
यथि
यदि
यदयपि
यदयपि
यस
यसक
यसक
यसपछि
यसब
यसम
यसर
यसल
यस
यस
यस
यह
यहसम
यह
रह
रह
रह
रह
पम
लगभग
लगयत
ि
वट
वरपर
पत
तवम
यद
सक
सक
गक
गस
सङ
सङगक
सट
सत
सध
सब
सब
सब
समय
सम
समभव
सम
सय
सरह
सहि
सहि
सह
यद
ि
पष
हज
हर
हर
नत
इन
ि
""".split()
)

View File

@ -349,7 +349,7 @@ cdef class Lexeme:
@property
def is_oov(self):
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
return self.orth in self.vocab.vectors
return self.orth not in self.vocab.vectors
property is_stop:
"""RETURNS (bool): Whether the lexeme is a stop word."""

View File

@ -528,10 +528,10 @@ class Tagger(Pipe):
new_tag_map[tag] = orig_tag_map[tag]
else:
new_tag_map[tag] = {POS: X}
if "_SP" in orig_tag_map:
new_tag_map["_SP"] = orig_tag_map["_SP"]
cdef Vocab vocab = self.vocab
if new_tag_map:
if "_SP" in orig_tag_map:
new_tag_map["_SP"] = orig_tag_map["_SP"]
vocab.morphology = Morphology(vocab.strings, new_tag_map,
vocab.morphology.lemmatizer,
exc=vocab.morphology.exc)

View File

@ -170,6 +170,11 @@ def nb_tokenizer():
return get_lang_class("nb").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def ne_tokenizer():
return get_lang_class("ne").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
def nl_tokenizer():
return get_lang_class("nl").Defaults.create_tokenizer()

View File

@ -4,7 +4,7 @@ from __future__ import unicode_literals
import pytest
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
from spacy.lang.ja import Japanese
from spacy.lang.ja import Japanese, DetailedToken
# fmt: off
TOKENIZER_TESTS = [
@ -96,6 +96,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
assert len(nlp_c(text)) == len_c
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
[
(
"選挙管理委員会",
[None, None, None, None],
[None, None, [
[
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
DetailedToken(surface='', tag='名詞-普通名詞-一般', inf='', lemma='', reading='カイ', sub_tokens=None),
]
]],
[[
[
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
DetailedToken(surface='', tag='名詞-普通名詞-一般', inf='', lemma='', reading='カイ', sub_tokens=None),
], [
DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
]
]]
),
]
)
def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
@pytest.mark.parametrize("text,inflections,reading_forms",
[
(
"取ってつけた",
("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
("トッ", "", "ツケ", ""),
),
]
)
def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
assert ja_tokenizer(text).user_data["inflections"] == inflections
assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
doc = ja_tokenizer("")
assert len(doc) == 0

View File

View File

@ -0,0 +1,19 @@
# coding: utf-8
from __future__ import unicode_literals
import pytest
def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
tokens = ne_tokenizer(text)
assert len(tokens) == 24
@pytest.mark.parametrize(
"text,length",
[("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
)
def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
tokens = ne_tokenizer(text)
assert len(tokens) == length

View File

@ -3,6 +3,7 @@ from __future__ import unicode_literals
import pytest
from spacy.language import Language
from spacy.symbols import POS, NOUN
def test_label_types():
@ -11,3 +12,16 @@ def test_label_types():
nlp.get_pipe("tagger").add_label("A")
with pytest.raises(ValueError):
nlp.get_pipe("tagger").add_label(9)
def test_tagger_begin_training_tag_map():
"""Test that Tagger.begin_training() without gold tuples does not clobber
the tag map."""
nlp = Language()
tagger = nlp.create_pipe("tagger")
orig_tag_count = len(tagger.labels)
tagger.add_label("A", {"POS": "NOUN"})
nlp.add_pipe(tagger)
nlp.begin_training()
assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)

View File

@ -376,6 +376,6 @@ def test_vector_is_oov():
data[1] = 2.0
vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1])
assert vocab["cat"].is_oov is True
assert vocab["dog"].is_oov is True
assert vocab["hamster"].is_oov is False
assert vocab["cat"].is_oov is False
assert vocab["dog"].is_oov is False
assert vocab["hamster"].is_oov is True

View File

@ -923,7 +923,7 @@ cdef class Token:
@property
def is_oov(self):
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
return self.c.lex.orth in self.vocab.vectors
return self.c.lex.orth not in self.vocab.vectors
@property
def is_stop(self):

View File

@ -208,6 +208,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
pipeline = nlp.Defaults.pipe_names
elif pipeline in (False, None):
pipeline = []
# skip "vocab" from overrides in component initialization since vocab is
# already configured from overrides when nlp is initialized above
if "vocab" in overrides:
del overrides["vocab"]
for name in pipeline:
if name not in disable:
config = meta.get("pipeline_args", {}).get(name, {})

View File

@ -12,18 +12,18 @@ expects true examples of a label to have the value `1.0`, and negative examples
of a label to have the value `0.0`. Labels not in the dictionary are treated as
missing the gradient for those labels will be zero.
| Name | Type | Description |
| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The document the annotations refer to. |
| `words` | iterable | A sequence of unicode word strings. |
| `tags` | iterable | A sequence of strings, representing tag annotations. |
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False.`. |
| **RETURNS** | `GoldParse` | The newly constructed object. |
| Name | Type | Description |
| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `doc` | `Doc` | The document the annotations refer to. |
| `words` | iterable | A sequence of unicode word strings. |
| `tags` | iterable | A sequence of strings, representing tag annotations. |
| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. |
| **RETURNS** | `GoldParse` | The newly constructed object. |
## GoldParse.\_\_len\_\_ {#len tag="method"}
@ -43,17 +43,17 @@ Whether the provided syntactic annotations form a projective dependency tree.
## Attributes {#attributes}
| Name | Type | Description |
| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `words` | list | The words. |
| `tags` | list | The part-of-speech tag annotations. |
| `heads` | list | The syntactic head annotations. |
| `labels` | list | The syntactic relation-type annotations. |
| `ner` | list | The named entity annotations as BILUO tags. |
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
| Name | Type | Description |
| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
| `words` | list | The words. |
| `tags` | list | The part-of-speech tag annotations. |
| `heads` | list | The syntactic head annotations. |
| `labels` | list | The syntactic relation-type annotations. |
| `ner` | list | The named entity annotations as BILUO tags. |
| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
| `cats` <Tag variant="new">2</Tag> | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
## Utilities {#util}
@ -61,7 +61,8 @@ Whether the provided syntactic annotations form a projective dependency tree.
Convert a list of Doc objects into the
[JSON-serializable format](/api/annotation#json-input) used by the
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
'paragraph' in the output doc.
> #### Example
>

View File

@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects.
| Name | Type | Description |
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).. |
| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3). |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
<Infobox title="Important note" variant="warning">

View File

@ -36,7 +36,7 @@ for token in doc:
| Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
| ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
| Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` |
| is | be | `VERB` | `VBZ` | `aux` | `xx` | `True` | `True` |
| is | be | `AUX` | `VBZ` | `aux` | `xx` | `True` | `True` |
| looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` |
| at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` |
| buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` |

View File

@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole
documents**, not just single sentences. If your corpus only contains single
sentences, spaCy's models will never learn to expect multi-sentence documents,
leading to low performance on real text. To mitigate this problem, you can use
the `-N` argument to the `spacy convert` command, to merge some of the sentences
the `-n` argument to the `spacy convert` command, to merge some of the sentences
into longer pseudo-documents.
### Training the tagger and parser {#train-tagger-parser}

View File

@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.")
print("Before", doc.ents) # []
header = [ENT_IOB, ENT_TYPE]
attr_array = numpy.zeros((len(doc), len(header)))
attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3 # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
@ -1143,9 +1143,9 @@ from spacy.gold import align
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
print("Misaligned tokens:", cost) # 2
print("Edit distance:", cost) # 3
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7])
print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7])
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
print("Many-to-one mappings b-> a", b2a_multi) # {}
```
@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi) # {}
Here are some insights from the alignment information generated in the example
above:
- Two tokens are misaligned.
- The edit distance (cost) is `3`: two deletions and one insertion.
- The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they're also identical in the
input: `"i"`, `"listened"`, `"to"` and `"obama"`.

View File

@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
better segmentation for Chinese OntoNotes and the new
[Chinese models](/models/zh).
<Infobox variant="warning">
Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
install it from our fork and compile it locally:
```bash
$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
```
</Infobox>
<Accordion title="Details on spaCy's PKUSeg API">
The `meta` argument of the `Chinese` language class supports the following
@ -196,12 +208,20 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
The Japanese language class uses
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. The default Japanese language class
and the provided Japanese models use SudachiPy split mode `A`.
segmentation and part-of-speech tagging. The default Japanese language class and
the provided Japanese models use SudachiPy split mode `A`.
The `meta` argument of the `Japanese` language class can be used to configure
the split mode to `A`, `B` or `C`.
<Infobox variant="warning">
If you run into errors related to `sudachipy`, which is currently under active
development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
used for training the current [Japanese models](/models/ja).
</Infobox>
## Installing and using models {#download}
> #### Downloading models in spaCy < v1.7

View File

@ -1158,17 +1158,17 @@ what you need for your application.
> available corpus.
For example, the corpus spaCy's [English models](/models/en) were trained on
defines a `PERSON` entity as just the **person name**, without titles like "Mr"
or "Dr". This makes sense, because it makes it easier to resolve the entity type
back to a knowledge base. But what if your application needs the full names,
_including_ the titles?
defines a `PERSON` entity as just the **person name**, without titles like "Mr."
or "Dr.". This makes sense, because it makes it easier to resolve the entity
type back to a knowledge base. But what if your application needs the full
names, _including_ the titles?
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```
@ -1233,7 +1233,7 @@ def expand_person_entities(doc):
# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')
doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```

View File

@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
and Romanian** and updated the training data and vectors for most languages.
Model packages with vectors are about **2&times** smaller on disk and load
**2-4&times;** faster. For the full changelog, see the [release notes on
GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
details and a behind-the-scenes look at the new release, [see our blog
post](https://explosion.ai/blog/spacy-v2-3).
**2-4&times;** faster. For the full changelog, see the
[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
For more details and a behind-the-scenes look at the new release,
[see our blog post](https://explosion.ai/blog/spacy-v2-3).
### Expanded model families with vectors {#models}
@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
`md` and `lg` models with word vectors for all languages, this release provides
a total of 46 model packages. For models trained using [Universal
Dependencies](https://universaldependencies.org) corpora, the training data has
been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
extended to include both UD Dutch Alpino and LassySmall.
a total of 46 model packages. For models trained using
[Universal Dependencies](https://universaldependencies.org) corpora, the
training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
<Infobox>
@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
### Chinese {#chinese}
> #### Example
>
> ```python
> from spacy.lang.zh import Chinese
>
@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
>
> # Append words to user dict
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
> ```
This release adds support for
[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
The Chinese tokenizer can be initialized with both `pkuseg` and custom models
and the `pkuseg` user dictionary is easy to customize.
[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
Chinese tokenizer can be initialized with both `pkuseg` and custom models and
the `pkuseg` user dictionary is easy to customize. Note that
[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
pre-compiled wheels for Python 3.8. See the
[usage documentation](/usage/models#chinese) for details on how to install it on
Python 3.8.
<Infobox>
**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
**Models:** [Chinese models](/models/zh) **Usage: **
[Chinese tokenizer usage](/usage/models#chinese)
</Infobox>
### Japanese {#japanese}
The updated Japanese language class switches to
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
`pip install spacy[ja]`.
<Infobox>
**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
**Models:** [Japanese models](/models/ja) **Usage:**
[Japanese tokenizer usage](/usage/models#japanese)
</Infobox>
### Small CLI updates
- `spacy debug-data` provides the coverage of the vectors in a base model with
`spacy debug-data lang train dev -b base_model`
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
dev.json`) to evaluate the tokenization accuracy without loading a model
- `spacy train` on GPU restricts the CPU timing evaluation to the first
iteration
- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
in a base model with `spacy debug-data lang train dev -b base_model`
- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
`spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
without loading a model
- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
the first iteration
## Backwards incompatibilities {#incompat}
@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
If you've been training **your own models**, you'll need to **retrain** them
with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
with models for v2.3. To check if all of your models are up to date, you can
run the [`spacy validate`](/api/cli#validate) command.
with models for v2.3. To check if all of your models are up to date, you can run
the [`spacy validate`](/api/cli#validate) command.
</Infobox>
@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
> directly.
- If you're training new models, you'll want to install the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
which now includes both the lemmatization tables (as in v2.2) and the
normalization tables (new in v2.3). If you're using pretrained models,
**nothing changes**, because the relevant tables are included in the model
packages.
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
now includes both the lemmatization tables (as in v2.2) and the normalization
tables (new in v2.3). If you're using pretrained models, **nothing changes**,
because the relevant tables are included in the model packages.
- Due to the updated Universal Dependencies training data, the fine-grained
part-of-speech tags will change for many provided language models. The
coarse-grained part-of-speech tagset remains the same, but the mapping from
particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
tagsets contain new merged tags related to contracted forms, such as
`ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
`"à"`. This increases the accuracy of the models by improving the alignment
between spaCy's tokenization and Universal Dependencies multi-word tokens
used for contractions.
tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
increases the accuracy of the models by improving the alignment between
spaCy's tokenization and Universal Dependencies multi-word tokens used for
contractions.
### Migrating from spaCy 2.2 {#migrating}
@ -143,29 +151,81 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
and earlier versions.
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
a comma at the end of a URL) before applying the match. See the full [tokenizer
documentation](/usage/linguistic-features#tokenization) and try out
cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
comma at the end of a URL) before applying the match. See the full
[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
debugging your tokenizer configuration.
#### Warnings configuration
spaCy's custom warnings have been replaced with native python
spaCy's custom warnings have been replaced with native Python
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
setting `SPACY_WARNING_IGNORE`, use the [warnings
setting `SPACY_WARNING_IGNORE`, use the [`warnings`
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
to manage warnings.
```diff
import spacy
+ import warnings
- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
```
#### Normalization tables
The normalization tables have moved from the language data in
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
the package
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
you're adding data for a new language, the normalization table should be added
to `spacy-lookups-data`. See [adding norm
exceptions](/usage/adding-languages#norm-exceptions).
[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
If you're adding data for a new language, the normalization table should be
added to `spacy-lookups-data`. See
[adding norm exceptions](/usage/adding-languages#norm-exceptions).
#### No preloaded lexemes/vocab for models with vectors
To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
loaded on initialization for models with vectors. As you process texts, the
lexemes will be added to the vocab automatically, just as in models without
vectors.
To see the number of unique vectors and number of words with vectors, see
`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
unique vectors and `684830` words with vectors:
```python
{
'width': 300,
'vectors': 20000,
'keys': 684830,
'name': 'en_core_web_md.vectors'
}
```
If required, for instance if you are working directly with word vectors rather
than processing texts, you can load all lexemes for words with vectors at once:
```python
for orth in nlp.vocab.vectors:
_ = nlp.vocab[orth]
```
#### Lexeme.is_oov and Token.is_oov
<Infobox title="Important note" variant="warning">
Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
fixed in the next patch release v2.3.1.
</Infobox>
In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
have a word vector. This is equivalent to `token.orth not in
nlp.vocab.vectors`.
Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
probability and cluster features. The probability and cluster features are no
longer included in the provided medium and large models (see the next section).
#### Probability and cluster features
@ -181,28 +241,28 @@ exceptions](/usage/adding-languages#norm-exceptions).
The `Token.prob` and `Token.cluster` features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
pretrained models to reduce the model size. To keep these features available
for users relying on them, the `prob` and `cluster` features for the most
frequent 1M tokens have been moved to
pretrained models to reduce the model size. To keep these features available for
users relying on them, the `prob` and `cluster` features for the most frequent
1M tokens have been moved to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
`extra` features for the relevant languages (English, German, Greek and
Spanish).
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
installed and your code accesses `Token.prob`, the full table is loaded into
the model vocab, which will take a few seconds on initial loading. When you
save this model after loading the `prob` table, the full `prob` table will be
saved as part of the model vocab.
installed and your code accesses `Token.prob`, the full table is loaded into the
model vocab, which will take a few seconds on initial loading. When you save
this model after loading the `prob` table, the full `prob` table will be saved
as part of the model vocab.
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
part of a new model, add the data to
If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
of a new model, add the data to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
currently only used to provide a custom `oov_prob`. See examples in the [`data`
directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
currently only used to provide a custom `oov_prob`. See examples in the
[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
in `spacy-lookups-data`.
#### Initializing new models without extra lookups tables

View File

@ -23,9 +23,9 @@
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
"indexName": "spacy"
},
"binderUrl": "ines/spacy-io-binder",
"binderUrl": "explosion/spacy-io-binder",
"binderBranch": "live",
"binderVersion": "2.2.0",
"binderVersion": "2.3.0",
"sections": [
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
{ "id": "models", "title": "Models Documentation", "theme": "blue" },