Update v2.3.x branch (#5636)

* Fix typos and auto-format [ci skip] * Add pkuseg warnings and auto-format [ci skip] * Update Binder URL [ci skip] * Update Binder version [ci skip] * Update alignment example for new gold.align * Update POS in tagging example * Fix numpy.zeros() dtype for Doc.from_array * Change example title to Dr. Change example title to Dr. so the current model does exclude the title in the initial example. * Fix spacy convert argument * Warning for sudachipy 0.4.5 (#5611) * Create myavrum.md (#5612) * Update lex_attrs.py (#5608) * Create mahnerak.md (#5615) * Some changes for Armenian (#5616) * Fixing numericals * We need a Armenian question sign to make the sentence a question * Add Nepali Language (#5622) * added support for nepali lang * added examples and test files * added spacy contributor agreement * Japanese model: add user_dict entries and small refactor (#5573) * user_dict fields: adding inflections, reading_forms, sub_tokens deleting: unidic_tags improve code readability around the token alignment procedure * add test cases, replace fugashi with sudachipy in conftest * move bunsetu.py to spaCy Universe as a pipeline component BunsetuRecognizer * tag is space -> both surface and tag are spaces * consider len(text)==0 * Add warnings example in v2.3 migration guide (#5627) * contribute (#5632) * Fix polarity of Token.is_oov and Lexeme.is_oov (#5634) Fix `Token.is_oov` and `Lexeme.is_oov` so they return `True` when the lexeme does **not** have a vector. * Extend what's new in v2.3 with vocab / is_oov (#5635) * Skip vocab in component config overrides (#5624) * Fix backslashes in warnings config diff (#5640) Fix backslashes in warnings config diff in v2.3 migration section. * Disregard special tag _SP in check for new tag map (#5641) * Skip special tag _SP in check for new tag map In `Tagger.begin_training()` check for new tags aside from `_SP` in the new tag map initialized from the provided gold tuples when determining whether to reinitialize the morphology with the new tag map. * Simplify _SP check Co-authored-by: Ines Montani <ines@ines.io> Co-authored-by: Marat M. Yavrumyan <myavrum@ysu.am> Co-authored-by: Karen Hambardzumyan <mahnerak@gmail.com> Co-authored-by: Rameshh <30867740+rameshhpathak@users.noreply.github.com> Co-authored-by: Hiroshi Matsuda <40782025+hiroshi-matsuda-rit@users.noreply.github.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2025-11-18 00:35:50 +03:00 · 2020-06-29 14:13:12 +02:00 · 2020-06-29 14:13:12 +02:00 · f42c9026f5
commit f42c9026f5
parent e9d3e177f0
31 changed files with 1458 additions and 365 deletions
--- a/.github/contributors/mahnerak.md
+++ b/.github/contributors/mahnerak.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Karen Hambardzumyan  |
 | Company name (if applicable)   | YerevaNN             |
 | Title or role (if applicable)  | Researcher           |
 | Date                           | 2020-06-19           |
 | GitHub username                | mahnerak             |
 | Website (optional)             | https://mahnerak.com/|
--- a/.github/contributors/myavrum.md
+++ b/.github/contributors/myavrum.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Marat M. Yavrumyan   |
 | Company name (if applicable)   | YSU, UD_Armenian Project |
 | Title or role (if applicable)  | Dr., Principal Investigator |
 | Date                           | 2020-06-19           |
 | GitHub username                | myavrum              |
 | Website (optional)             | http://armtreebank.yerevann.com/ |
--- a/.github/contributors/rameshhpathak.md
+++ b/.github/contributors/rameshhpathak.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Ramesh Pathak        |
 | Company name (if applicable)   | Diyo AI              |
 | Title or role (if applicable)  | AI Engineer          |
 | Date                           | June 21, 2020        |
 | GitHub username                | rameshhpathak        |
 | Website (optional)             |rameshhpathak.github.io|                      |
--- a/.github/contributors/richardliaw.md
+++ b/.github/contributors/richardliaw.md
@ -0,0 +1,106 @@
 # spaCy contributor agreement
 This spaCy Contributor Agreement (**"SCA"**) is based on the
 [Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
 The SCA applies to any contribution that you make to any product or project
 managed by us (the **"project"**), and sets out the intellectual property rights
 you grant to us in the contributed materials. The term **"us"** shall mean
 [ExplosionAI GmbH](https://explosion.ai/legal). The term
 **"you"** shall mean the person or entity identified below.
 If you agree to be bound by these terms, fill in the information requested
 below and include the filled-in version with your first pull request, under the
 folder [`.github/contributors/`](/.github/contributors/). The name of the file
 should be your GitHub username, with the extension `.md`. For example, the user
 example_user would create the file `.github/contributors/example_user.md`.
 Read this agreement carefully before signing. These terms and conditions
 constitute a binding legal agreement.
 ## Contributor Agreement
 1. The term "contribution" or "contributed materials" means any source code,
 object code, patch, tool, sample, graphic, specification, manual,
 documentation, or any other material posted or submitted by you to the project.
 2. With respect to any worldwide copyrights, or copyright applications and
 registrations, in your contribution:
    * you hereby assign to us joint ownership, and to the extent that such
    assignment is or becomes invalid, ineffective or unenforceable, you hereby
    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
    royalty-free, unrestricted license to exercise all rights under those
    copyrights. This includes, at our option, the right to sublicense these same
    rights to third parties through multiple levels of sublicensees or other
    licensing arrangements;
    * you agree that each of us can do all things in relation to your
    contribution as if each of us were the sole owners, and if one of us makes
    a derivative work of your contribution, the one who makes the derivative
    work (or has it made will be the sole owner of that derivative work;
    * you agree that you will not assert any moral rights in your contribution
    against us, our licensees or transferees;
    * you agree that we may register a copyright in your contribution and
    exercise all ownership rights associated with it; and
    * you agree that neither of us has any duty to consult with, obtain the
    consent of, pay or render an accounting to the other for any use or
    distribution of your contribution.
 3. With respect to any patents you own, or that you can license without payment
 to any third party, you hereby grant to us a perpetual, irrevocable,
 non-exclusive, worldwide, no-charge, royalty-free license to:
    * make, have made, use, sell, offer to sell, import, and otherwise transfer
    your contribution in whole or in part, alone or in combination with or
    included in any product, work or materials arising out of the project to
    which your contribution was submitted, and
    * at our option, to sublicense these same rights to third parties through
    multiple levels of sublicensees or other licensing arrangements.
 4. Except as set out above, you keep all right, title, and interest in your
 contribution. The rights that you grant to us under these terms are effective
 on the date you first submitted a contribution to us, even if your submission
 took place before the date you sign these terms.
 5. You covenant, represent, warrant and agree that:
    * Each contribution that you submit is and shall be an original work of
    authorship and you can legally grant the rights set out in this SCA;
    * to the best of your knowledge, each contribution will not violate any
    third party's copyrights, trademarks, patents, or other intellectual
    property rights; and
    * each contribution shall be in compliance with U.S. export control laws and
    other applicable export and import laws. You agree to notify us if you
    become aware of any circumstance which would make any of the foregoing
    representations inaccurate in any respect. We may publicly disclose your
    participation in the project, including the fact that you have signed the SCA.
 6. This SCA is governed by the laws of the State of California and applicable
 U.S. Federal law. Any choice of law rules will not apply.
 7. Please place an “x” on one of the applicable statement below. Please do NOT
 mark both statements:
    * [x] I am signing on behalf of myself as an individual and no other person
    or entity, including my employer, has or will have rights with respect to my
    contributions.
    * [ ] I am signing on behalf of my employer or a legal entity and I have the
    actual authority to contractually bind that entity.
 ## Contributor Details
 | Field                          | Entry                |
 |------------------------------- | -------------------- |
 | Name                           | Richard Liaw         |
 | Company name (if applicable)   |                      |
 | Title or role (if applicable)  |                      |
 | Date                           | 06/22/2020           |
 | GitHub username                | richardliaw          |
 | Website (optional)             |                      |
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models.
 sentences = [
    "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
    "Ո՞վ է Ֆրանսիայի նախագահը։",
-    "Որն է Միացյալ Նահանգների մայրաքաղաքը։",
+    "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
    "Ե՞րբ է ծնվել Բարաք Օբաման։",
 ]
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM
 _num_words = [
-    "զրօ",
+    "զրո",
-    "մէկ",
+    "մեկ",
    "երկու",
    "երեք",
    "չորս",
@ -18,20 +18,21 @@ _num_words = [
    "տասը",
    "տասնմեկ",
    "տասներկու",
-    "տասներեք",
+    "տասներեք",
-    "տասնչորս",
+    "տասնչորս",
-    "տասնհինգ",
+    "տասնհինգ",
-    "տասնվեց",
+    "տասնվեց",
-    "տասնյոթ",
+    "տասնյոթ",
-    "տասնութ",
+    "տասնութ",
-    "տասնինը",
+    "տասնինը",
-    "քսան" "երեսուն",
+    "քսան",
    "երեսուն",
    "քառասուն",
    "հիսուն",
-    "վաթցսուն",
+    "վաթսուն",
    "յոթանասուն",
    "ութսուն",
-    "ինիսուն",
+    "իննսուն",
    "հարյուր",
    "հազար",
    "միլիոն",
--- a/spacy/lang/ja/init.py
+++ b/spacy/lang/ja/init.py
@ -20,12 +20,7 @@ from ... import util
 # Hold the attributes we need with convenient names
-DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
+DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"])
 # Handling for multiple spaces in a row is somewhat awkward, this simplifies
 # the flow by creating a dummy with the same interface.
 DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
 DummySpace = DummyNode(" ", " ", " ")
 def try_sudachi_import(split_mode="A"):
@ -53,7 +48,7 @@ def try_sudachi_import(split_mode="A"):
        )
-def resolve_pos(orth, pos, next_pos):
+def resolve_pos(orth, tag, next_tag):
    """If necessary, add a field to the POS tag for UD mapping.
    Under Universal Dependencies, sometimes the same Unidic POS tag can
    be mapped differently depending on the literal token or its context
@ -64,124 +59,77 @@ def resolve_pos(orth, pos, next_pos):
    # Some tokens have their UD tag decided based on the POS of the following
    # token.
-    # orth based rules
+    # apply orth based mapping
-    if pos[0] in TAG_ORTH_MAP:
+    if tag in TAG_ORTH_MAP:
-        orth_map = TAG_ORTH_MAP[pos[0]]
+        orth_map = TAG_ORTH_MAP[tag]
        if orth in orth_map:
-            return orth_map[orth], None
+            return orth_map[orth], None  # current_pos, next_pos
-    # tag bi-gram mapping
+    # apply tag bi-gram mapping
-    if next_pos:
+    if next_tag:
-        tag_bigram = pos[0], next_pos[0]
+        tag_bigram = tag, next_tag
        if tag_bigram in TAG_BIGRAM_MAP:
-            bipos = TAG_BIGRAM_MAP[tag_bigram]
+            current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
-            if bipos[0] is None:
+            if current_pos is None:  # apply tag uni-gram mapping for current_pos
-                return TAG_MAP[pos[0]][POS], bipos[1]
+                return TAG_MAP[tag][POS], next_pos  # only next_pos is identified by tag bi-gram mapping
            else:
-                return bipos
+                return current_pos, next_pos
-    return TAG_MAP[pos[0]][POS], None
+    # apply tag uni-gram mapping
    return TAG_MAP[tag][POS], None
-# Use a mapping of paired punctuation to avoid splitting quoted sentences.
+def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
-pairpunct = {'「':'」', '『': '』', '【': '】'}
+    # Compare the content of tokens and text, first
 def separate_sentences(doc):
    """Given a doc, mark tokens that start sentences based on Unidic tags.
    """
    stack = [] # save paired punctuation
    for i, token in enumerate(doc[:-2]):
        # Set all tokens after the first to false by default. This is necessary
        # for the doc code to be aware we've done sentencization, see
        # `is_sentenced`.
        token.sent_start = (i == 0)
        if token.tag_:
            if token.tag_ == "補助記号-括弧開":
                ts = str(token)
                if ts in pairpunct:
                    stack.append(pairpunct[ts])
                elif stack and ts == stack[-1]:
                    stack.pop()
            if token.tag_ == "補助記号-句点":
                next_token = doc[i+1]
                if next_token.tag_ != token.tag_ and not stack:
                    next_token.sent_start = True
 def get_dtokens(tokenizer, text):
    tokens = tokenizer.tokenize(text)
    words = []
    for ti, token in enumerate(tokens):
        tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*'])
        inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*'])
        dtoken = DetailedToken(
                token.surface(),
                (tag, inf),
                token.dictionary_form())
        if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白':
            # don't add multiple space tokens in a row
            continue
        words.append(dtoken)
    # remove empty tokens. These can be produced with characters like … that
    # Sudachi normalizes internally. 
    words = [ww for ww in words if len(ww.surface) > 0]
    return words
 def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
    words = [x.surface for x in dtokens]
    if "".join("".join(words).split()) != "".join(text.split()):
        raise ValueError(Errors.E194.format(text=text, words=words))
-    text_words = []
+
-    text_lemmas = []
+    text_dtokens = []
    text_tags = []
    text_spaces = []
    text_pos = 0
    # handle empty and whitespace-only texts
    if len(words) == 0:
-        return text_words, text_lemmas, text_tags, text_spaces
+        return text_dtokens, text_spaces
    elif len([word for word in words if not word.isspace()]) == 0:
        assert text.isspace()
-        text_words = [text]
+        text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)]
        text_lemmas = [text]
        text_tags = [gap_tag]
        text_spaces = [False]
-        return text_words, text_lemmas, text_tags, text_spaces
+        return text_dtokens, text_spaces
-    # normalize words to remove all whitespace tokens
+
-    norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()])
+    # align words and dtokens by referring text, and insert gap tokens for the space char spans
-    # align words with text
+    for word, dtoken in zip(words, dtokens):
-    for word, dtoken in zip(norm_words, norm_dtokens):
+        # skip all space tokens
        if word.isspace():
            continue
        try:
            word_start = text[text_pos:].index(word)
        except ValueError:
            raise ValueError(Errors.E194.format(text=text, words=words))
        # space token
        if word_start > 0:
            w = text[text_pos:text_pos + word_start]
-            text_words.append(w)
+            text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
            text_lemmas.append(w)
            text_tags.append(gap_tag)
            text_spaces.append(False)
            text_pos += word_start
-        text_words.append(word)
+
-        text_lemmas.append(dtoken.lemma)
+        # content word
-        text_tags.append(dtoken.pos)
+        text_dtokens.append(dtoken)
        text_spaces.append(False)
        text_pos += len(word)
        # poll a space char after the word
        if text_pos < len(text) and text[text_pos] == " ":
            text_spaces[-1] = True
            text_pos += 1
    # trailing space token
    if text_pos < len(text):
        w = text[text_pos:]
-        text_words.append(w)
+        text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
        text_lemmas.append(w)
        text_tags.append(gap_tag)
        text_spaces.append(False)
-    return text_words, text_lemmas, text_tags, text_spaces
+
    return text_dtokens, text_spaces
 class JapaneseTokenizer(DummyTokenizer):
@ -191,29 +139,78 @@ class JapaneseTokenizer(DummyTokenizer):
        self.tokenizer = try_sudachi_import(self.split_mode)
    def __call__(self, text):
-        dtokens = get_dtokens(self.tokenizer, text)
+        # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
        sudachipy_tokens = self.tokenizer.tokenize(text)
        dtokens = self._get_dtokens(sudachipy_tokens)
        dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
-        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
+        # create Doc with tag bi-gram based part-of-speech identification rules
        words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6
        sub_tokens_list = list(sub_tokens_list)
        doc = Doc(self.vocab, words=words, spaces=spaces)
-        next_pos = None
+        next_pos = None  # for bi-gram rules
-        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
+        for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
-            token.tag_ = unidic_tag[0]
+            token.tag_ = dtoken.tag
-            if next_pos:
+            if next_pos:  # already identified in previous iteration
                token.pos = next_pos
                next_pos = None
            else:
                token.pos, next_pos = resolve_pos(
                    token.orth_,
-                    unidic_tag,
+                    dtoken.tag,
-                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None
+                    tags[idx + 1] if idx + 1 < len(tags) else None
                )
            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = lemma
+            token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
-        doc.user_data["unidic_tags"] = unidic_tags
+
        doc.user_data["inflections"] = inflections
        doc.user_data["reading_forms"] = readings
        doc.user_data["sub_tokens"] = sub_tokens_list
        return doc
    def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
        sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
        dtokens = [
            DetailedToken(
                token.surface(),  # orth
                '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']),  # tag
                ','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']),  # inf
                token.dictionary_form(),  # lemma
                token.reading_form(),  # user_data['reading_forms']
                sub_tokens_list[idx] if sub_tokens_list else None,  # user_data['sub_tokens']
            ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
            # remove empty tokens which can be produced with characters like … that
        ]
        # Sudachi normalizes internally and outputs each space char as a token.
        # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
        return [
            t for idx, t in enumerate(dtokens) if
            idx == 0 or
            not t.surface.isspace() or t.tag != '空白' or
            not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白'
        ]
    def _get_sub_tokens(self, sudachipy_tokens):
        if self.split_mode is None or self.split_mode == "A":  # do nothing for default split mode
            return None
        sub_tokens_list = []  # list of (list of list of DetailedToken | None)
        for token in sudachipy_tokens:
            sub_a = token.split(self.tokenizer.SplitMode.A)
            if len(sub_a) == 1:  # no sub tokens
                sub_tokens_list.append(None)
            elif self.split_mode == "B":
                sub_tokens_list.append([self._get_dtokens(sub_a, False)])
            else:  # "C"
                sub_b = token.split(self.tokenizer.SplitMode.B)
                if len(sub_a) == len(sub_b):
                    dtokens = self._get_dtokens(sub_a, False)
                    sub_tokens_list.append([dtokens, dtokens])
                else:
                    sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)])
        return sub_tokens_list
    def _get_config(self):
        config = OrderedDict(
            (
--- a/spacy/lang/ja/bunsetu.py
+++ b/spacy/lang/ja/bunsetu.py
@ -1,144 +0,0 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 POS_PHRASE_MAP = {
    "NOUN": "NP",
    "NUM": "NP",
    "PRON": "NP",
    "PROPN": "NP",
    "VERB": "VP",
    "ADJ": "ADJP",
    "ADV": "ADVP",
    "CCONJ": "CCONJP",
 }
 # return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
 def yield_bunsetu(doc, debug=False):
    bunsetu = []
    bunsetu_may_end = False
    phrase_type = None
    phrase = None
    prev = None
    prev_tag = None
    prev_dep = None
    prev_head = None
    for t in doc:
        pos = t.pos_
        pos_type = POS_PHRASE_MAP.get(pos, None)
        tag = t.tag_
        dep = t.dep_
        head = t.head.i
        if debug:
            print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu)
        # DET is always an individual bunsetu
        if pos == "DET":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            yield [t], None, None
            bunsetu = []
            bunsetu_may_end = False
            phrase_type = None
            phrase = None
        # PRON or Open PUNCT always splits bunsetu
        elif tag == "補助記号-括弧開":
            if bunsetu:
                yield bunsetu, phrase_type, phrase
            bunsetu = [t]
            bunsetu_may_end = True
            phrase_type = None
            phrase = None
        # bunsetu head not appeared
        elif phrase_type is None:
            if bunsetu and prev_tag == "補助記号-読点":
                yield bunsetu, phrase_type, phrase
                bunsetu = []
                bunsetu_may_end = False
                phrase_type = None
                phrase = None
            bunsetu.append(t)
            if pos_type:  # begin phrase
                phrase = [t]
                phrase_type = pos_type
                if pos_type in {"ADVP", "CCONJP"}:
                    bunsetu_may_end = True
        # entering new bunsetu
        elif pos_type and (
            pos_type != phrase_type or  # different phrase type arises
            bunsetu_may_end  # same phrase type but bunsetu already ended
        ):
            # exceptional case: NOUN to VERB
            if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i:
                bunsetu.append(t)
                phrase_type = "VP"
                phrase.append(t)
            # exceptional case: VERB to NOUN
            elif phrase_type == "VP" and pos_type == "NP" and (
                    prev_dep == 'compound' and prev_head == t.i or
                    dep == 'compound' and prev == head or
                    prev_dep == 'nmod' and prev_head == t.i
            ):
                bunsetu.append(t)
                phrase_type = "NP"
                phrase.append(t)
            else:
                yield bunsetu, phrase_type, phrase
                bunsetu = [t]
                bunsetu_may_end = False
                phrase_type = pos_type
                phrase = [t]
        # NOUN bunsetu
        elif phrase_type == "NP":
            bunsetu.append(t)
            if not bunsetu_may_end and ((
                (pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'}
            ) or (
                pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
            )):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # VERB bunsetu
        elif phrase_type == "VP":
            bunsetu.append(t)
            if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound':
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # ADJ bunsetu
        elif phrase_type == "ADJP" and tag != '連体詞':
            bunsetu.append(t)
            if not bunsetu_may_end and ((
                pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'}
            ) or (
                pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
            )):
                phrase.append(t)
            else:
                bunsetu_may_end = True
        # other bunsetu
        else:
            bunsetu.append(t)
        prev = t.i
        prev_tag = t.tag_
        prev_dep = t.dep_
        prev_head = head
    if bunsetu:
        yield bunsetu, phrase_type, phrase
--- a/spacy/lang/ne/init.py
+++ b/spacy/lang/ne/init.py
@ -0,0 +1,23 @@
 # coding: utf8
 from __future__ import unicode_literals
 from .stop_words import STOP_WORDS
 from .lex_attrs import LEX_ATTRS
 from ...language import Language
 from ...attrs import LANG
 class NepaliDefaults(Language.Defaults):
    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
    lex_attr_getters.update(LEX_ATTRS)
    lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
    stop_words = STOP_WORDS
 class Nepali(Language):
    lang = "ne"
    Defaults = NepaliDefaults
 __all__ = ["Nepali"]
--- a/spacy/lang/ne/examples.py
+++ b/spacy/lang/ne/examples.py
@ -0,0 +1,22 @@
 # coding: utf8
 from __future__ import unicode_literals
 """
 Example sentences to test spaCy and its language models.
 >>> from spacy.lang.ne.examples import sentences
 >>> docs = nlp.pipe(sentences)
 """
 sentences = [
    "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
    "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
    "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
    "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
    "तिमी कहाँ छौ?",
    "फ्रान्स को राष्ट्रपति को हो?",
    "संयुक्त राज्यको राजधानी के हो?",
    "बराक ओबामा कहिले कहिले जन्मेका हुन्?",
 ]
--- a/spacy/lang/ne/lex_attrs.py
+++ b/spacy/lang/ne/lex_attrs.py
@ -0,0 +1,98 @@
 # coding: utf8
 from __future__ import unicode_literals
 from ..norm_exceptions import BASE_NORMS
 from ...attrs import NORM, LIKE_NUM
 # fmt: off
 _stem_suffixes = [
    ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
    ["ँ", "ं", "्", "ः"],
    ["लाई", "ले", "बाट", "को", "मा", "हरू"],
    ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
    ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
    ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
    ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
    ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
    ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
    ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
    ["याइ", "ाइ", "बार", "वार", "चाँहि"],
    ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
 ]
 # fmt: on
 # reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
 # reference 2: https://www.imnepal.com/nepali-numbers/
 _num_words = [
    "शुन्य",
    "एक",
    "दुई",
    "तीन",
    "चार",
    "पाँच",
    "छ",
    "सात",
    "आठ",
    "नौ",
    "दश",
    "एघार",
    "बाह्र",
    "तेह्र",
    "चौध",
    "पन्ध्र",
    "सोह्र",
    "सोह्र",
    "सत्र",
    "अठार",
    "उन्नाइस",
    "बीस",
    "तीस",
    "चालीस",
    "पचास",
    "साठी",
    "सत्तरी",
    "असी",
    "नब्बे",
    "सय",
    "हजार",
    "लाख",
    "करोड",
    "अर्ब",
    "खर्ब",
 ]
 def norm(string):
    # normalise base exceptions,  e.g. punctuation or currency symbols
    if string in BASE_NORMS:
        return BASE_NORMS[string]
    # set stem word as norm,  if available,  adapted from:
    # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
    # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
    for suffix_group in reversed(_stem_suffixes):
        length = len(suffix_group[0])
        if len(string) <= length:
            break
        for suffix in suffix_group:
            if string.endswith(suffix):
                return string[:-length]
    return string
 def like_num(text):
    if text.startswith(("+", "-", "±", "~")):
        text = text[1:]
    text = text.replace(", ", "").replace(".", "")
    if text.isdigit():
        return True
    if text.count("/") == 1:
        num, denom = text.split("/")
        if num.isdigit() and denom.isdigit():
            return True
    if text.lower() in _num_words:
        return True
    return False
 LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
--- a/spacy/lang/ne/stop_words.py
+++ b/spacy/lang/ne/stop_words.py
@ -0,0 +1,498 @@
 # coding: utf8
 from __future__ import unicode_literals
 # Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
 STOP_WORDS = set(
    """
 अक्सर
 अगाडि
 अगाडी
 अघि
 अझै
 अठार
 अथवा
 अनि
 अनुसार
 अन्तर्गत
 अन्य
 अन्यत्र
 अन्यथा
 अब
 अरु
 अरुलाई
 अरू
 अर्को
 अर्थात
 अर्थात्
 अलग
 अलि
 अवस्था
 अहिले
 आए
 आएका
 आएको
 आज
 आजको
 आठ
 आत्म
 आदि
 आदिलाई
 आफनो
 आफू
 आफूलाई
 आफै
 आफैँ
 आफ्नै
 आफ्नो
 आयो
 उ
 उक्त
 उदाहरण
 उनको
 उनलाई
 उनले
 उनि
 उनी
 उनीहरुको
 उन्नाइस
 उप
 उसको
 उसलाई
 उसले
 उहालाई
 ऊ
 एउटा
 एउटै
 एक
 एकदम
 एघार
 ओठ
 औ
 औं
 कता
 कति
 कतै
 कम
 कमसेकम
 कसरि
 कसरी
 कसै
 कसैको
 कसैलाई
 कसैले
 कसैसँग
 कस्तो
 कहाँबाट
 कहिलेकाहीं
 का
 काम
 कारण
 कि
 किन
 किनभने
 कुन
 कुनै
 कुन्नी
 कुरा
 कृपया
 के
 केहि
 केही
 को
 कोहि
 कोहिपनि
 कोही
 कोहीपनि
 क्रमशः
 गए
 गएको
 गएर
 गयौ
 गरि
 गरी
 गरे
 गरेका
 गरेको
 गरेर
 गरौं
 गर्छ
 गर्छन्
 गर्छु
 गर्दा
 गर्दै
 गर्न
 गर्नु
 गर्नुपर्छ
 गर्ने
 गैर
 घर
 चार
 चाले
 चाहनुहुन्छ
 चाहन्छु
 चाहिं
 चाहिए
 चाहिंले
 चाहीं
 चाहेको
 चाहेर
 चोटी
 चौथो
 चौध
 छ
 छन
 छन्
 छु
 छू
 छैन
 छैनन्
 छौ
 छौं
 जता
 जताततै
 जना
 जनाको
 जनालाई
 जनाले
 जब
 जबकि
 जबकी
 जसको
 जसबाट
 जसमा
 जसरी
 जसलाई
 जसले
 जस्ता
 जस्तै
 जस्तो
 जस्तोसुकै
 जहाँ
 जान
 जाने
 जाहिर
 जुन
 जुनै
 जे
 जो
 जोपनि
 जोपनी
 झैं
 ठाउँमा
 ठीक
 ठूलो
 त
 तता
 तत्काल
 तथा
 तथापि
 तथापी
 तदनुसार
 तपाइ
 तपाई
 तपाईको
 तब
 तर
 तर्फ
 तल
 तसरी
 तापनि
 तापनी
 तिन
 तिनि
 तिनिहरुलाई
 तिनी
 तिनीहरु
 तिनीहरुको
 तिनीहरू
 तिनीहरूको
 तिनै
 तिमी
 तिर
 तिरको
 ती
 तीन
 तुरन्त
 तुरुन्त
 तुरुन्तै
 तेश्रो
 तेस्कारण
 तेस्रो
 तेह्र
 तैपनि
 तैपनी
 त्यत्तिकै
 त्यत्तिकैमा
 त्यस
 त्यसकारण
 त्यसको
 त्यसले
 त्यसैले
 त्यसो
 त्यस्तै
 त्यस्तो
 त्यहाँ
 त्यहिँ
 त्यही
 त्यहीँ
 त्यहीं
 त्यो
 त्सपछि
 त्सैले
 थप
 थरि
 थरी
 थाहा
 थिए
 थिएँ
 थिएन
 थियो
 दर्ता
 दश
 दिए
 दिएको
 दिन
 दिनुभएको
 दिनुहुन्छ
 दुइ
 दुइवटा
 दुई
 देखि
 देखिन्छ
 देखियो
 देखे
 देखेको
 देखेर
 दोश्री
 दोश्रो
 दोस्रो
 द्वारा
 धन्न
 धेरै
 धौ
 न
 नगर्नु
 नगर्नू
 नजिकै
 नत्र
 नत्रभने
 नभई
 नभएको
 नभनेर
 नयाँ
 नि
 निकै
 निम्ति
 निम्न
 निम्नानुसार
 निर्दिष्ट
 नै
 नौ
 पक्का
 पक्कै
 पछाडि
 पछाडी
 पछि
 पछिल्लो
 पछी
 पटक
 पनि
 पन्ध्र
 पर्छ
 पर्थ्यो
 पर्दैन
 पर्ने
 पर्नेमा
 पर्याप्त
 पहिले
 पहिलो
 पहिल्यै
 पाँच
 पांच
 पाचौँ
 पाँचौं
 पिच्छे
 पूर्व
 पो
 प्रति
 प्रतेक
 प्रत्यक
 प्राय
 प्लस
 फरक
 फेरि
 फेरी
 बढी
 बताए
 बने
 बरु
 बाट
 बारे
 बाहिर
 बाहेक
 बाह्र
 बिच
 बिचमा
 बिरुद्ध
 बिशेष
 बिस
 बीच
 बीचमा
 बीस
 भए
 भएँ
 भएका
 भएकालाई
 भएको
 भएन
 भएर
 भन
 भने
 भनेको
 भनेर
 भन्
 भन्छन्
 भन्छु
 भन्दा
 भन्दै
 भन्नुभयो
 भन्ने
 भन्या
 भयेन
 भयो
 भर
 भरि
 भरी
 भा
 भित्र
 भित्री
 भीत्र
 म
 मध्य
 मध्ये
 मलाई
 मा
 मात्र
 मात्रै
 माथि
 माथी
 मुख्य
 मुनि
 मुन्तिर
 मेरो
 मैले
 यति
 यथोचित
 यदि
 यद्ध्यपि
 यद्यपि
 यस
 यसका
 यसको
 यसपछि
 यसबाहेक
 यसमा
 यसरी
 यसले
 यसो
 यस्तै
 यस्तो
 यहाँ
 यहाँसम्म
 यही
 या
 यी
 यो
 र
 रही
 रहेका
 रहेको
 रहेछ
 राखे
 राख्छ
 राम्रो
 रुपमा
 रूप
 रे
 लगभग
 लगायत
 लाई
 लाख
 लागि
 लागेको
 ले
 वटा
 वरीपरी
 वा
 वाट
 वापत
 वास्तवमा
 शायद
 सक्छ
 सक्ने
 सँग
 संग
 सँगको
 सँगसँगै
 सँगै
 संगै
 सङ्ग
 सङ्गको
 सट्टा
 सत्र
 सधै
 सबै
 सबैको
 सबैलाई
 समय
 समेत
 सम्भव
 सम्म
 सय
 सरह
 सहित
 सहितै
 सही
 साँच्चै
 सात
 साथ
 साथै
 सायद
 सारा
 सुनेको
 सुनेर
 सुरु
 सुरुको
 सुरुमै
 सो
 सोचेको
 सोचेर
 सोही
 सोह्र
 स्थित
 स्पष्ट
 हजार
 हरे
 हरेक
 हामी
 हामीले
 हाम्रा
 हाम्रो
 हुँदैन
 हुन
 हुनत
 हुनु
 हुने
 हुनेछ
 हुन्
 हुन्छ
 हुन्थ्यो
 हैन
 हो
 होइन
 होकि
 होला
 """.split()
 )
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@ -349,7 +349,7 @@ cdef class Lexeme:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        return self.orth in self.vocab.vectors
+        return self.orth not in self.vocab.vectors
    property is_stop:
        """RETURNS (bool): Whether the lexeme is a stop word."""
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@ -528,10 +528,10 @@ class Tagger(Pipe):
                        new_tag_map[tag] = orig_tag_map[tag]
                    else:
                        new_tag_map[tag] = {POS: X}
        if "_SP" in orig_tag_map:
            new_tag_map["_SP"] = orig_tag_map["_SP"]
        cdef Vocab vocab = self.vocab
        if new_tag_map:
            if "_SP" in orig_tag_map:
                new_tag_map["_SP"] = orig_tag_map["_SP"]
            vocab.morphology = Morphology(vocab.strings, new_tag_map,
                                          vocab.morphology.lemmatizer,
                                          exc=vocab.morphology.exc)
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@ -170,6 +170,11 @@ def nb_tokenizer():
    return get_lang_class("nb").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def ne_tokenizer():
    return get_lang_class("ne").Defaults.create_tokenizer()
@pytest.fixture(scope="session")
 def nl_tokenizer():
    return get_lang_class("nl").Defaults.create_tokenizer()
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@ -4,7 +4,7 @@ from __future__ import unicode_literals
 import pytest
 from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
-from spacy.lang.ja import Japanese
+from spacy.lang.ja import Japanese, DetailedToken
 # fmt: off
 TOKENIZER_TESTS = [
@ -96,6 +96,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
    assert len(nlp_c(text)) == len_c
@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
    [
        (
            "選挙管理委員会",
            [None, None, None, None],
            [None, None, [
                [
                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
                ]
            ]],
            [[
                [
                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
                ], [
                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
                    DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
                ]
            ]]
        ),
    ]
 )
 def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
    nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
    nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
    nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
    assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
    assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
    assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
    assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
@pytest.mark.parametrize("text,inflections,reading_forms",
    [
        (
            "取ってつけた",
            ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
            ("トッ", "テ", "ツケ", "タ"),
        ),
    ]
 )
 def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
    assert ja_tokenizer(text).user_data["inflections"] == inflections
    assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
 def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
    doc = ja_tokenizer("")
    assert len(doc) == 0
--- a/spacy/tests/lang/ne/init.py
+++ b/spacy/tests/lang/ne/init.py
--- a/spacy/tests/lang/ne/test_text.py
+++ b/spacy/tests/lang/ne/test_text.py
@ -0,0 +1,19 @@
 # coding: utf-8
 from __future__ import unicode_literals
 import pytest
 def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
    text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
    tokens = ne_tokenizer(text)
    assert len(tokens) == 24
@pytest.mark.parametrize(
    "text,length",
    [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
 )
 def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
    tokens = ne_tokenizer(text)
    assert len(tokens) == length
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@ -3,6 +3,7 @@ from __future__ import unicode_literals
 import pytest
 from spacy.language import Language
 from spacy.symbols import POS, NOUN
 def test_label_types():
@ -11,3 +12,16 @@ def test_label_types():
    nlp.get_pipe("tagger").add_label("A")
    with pytest.raises(ValueError):
        nlp.get_pipe("tagger").add_label(9)
 def test_tagger_begin_training_tag_map():
    """Test that Tagger.begin_training() without gold tuples does not clobber
    the tag map."""
    nlp = Language()
    tagger = nlp.create_pipe("tagger")
    orig_tag_count = len(tagger.labels)
    tagger.add_label("A", {"POS": "NOUN"})
    nlp.add_pipe(tagger)
    nlp.begin_training()
    assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
    assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@ -376,6 +376,6 @@ def test_vector_is_oov():
    data[1] = 2.0
    vocab.set_vector("cat", data[0])
    vocab.set_vector("dog", data[1])
-    assert vocab["cat"].is_oov is True
+    assert vocab["cat"].is_oov is False
-    assert vocab["dog"].is_oov is True
+    assert vocab["dog"].is_oov is False
-    assert vocab["hamster"].is_oov is False
+    assert vocab["hamster"].is_oov is True
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@ -923,7 +923,7 @@ cdef class Token:
    @property
    def is_oov(self):
        """RETURNS (bool): Whether the token is out-of-vocabulary."""
-        return self.c.lex.orth in self.vocab.vectors
+        return self.c.lex.orth not in self.vocab.vectors
    @property
    def is_stop(self):
--- a/spacy/util.py
+++ b/spacy/util.py
@ -208,6 +208,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
        pipeline = nlp.Defaults.pipe_names
    elif pipeline in (False, None):
        pipeline = []
    # skip "vocab" from overrides in component initialization since vocab is
    # already configured from overrides when nlp is initialized above
    if "vocab" in overrides:
        del overrides["vocab"]
    for name in pipeline:
        if name not in disable:
            config = meta.get("pipeline_args", {}).get(name, {})
--- a/website/docs/api/goldparse.md
+++ b/website/docs/api/goldparse.md
@ -13,7 +13,7 @@ of a label to have the value `0.0`. Labels not in the dictionary are treated as
 missing – the gradient for those labels will be zero.
 | Name              | Type        | Description                                                                                                                                                                                                                            |
-| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | `doc`             | `Doc`       | The document the annotations refer to.                                                                                                                                                                                                 |
 | `words`           | iterable    | A sequence of unicode word strings.                                                                                                                                                                                                    |
 | `tags`            | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                                   |
@ -22,7 +22,7 @@ missing – the gradient for those labels will be zero.
 | `entities`        | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
 | `cats`            | dict        | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative).                                                                                  |
 | `links`           | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative).                       |
-| `make_projective` | bool  | Whether to projectivize the dependency tree. Defaults to `False.`.                                                                                     |
+| `make_projective` | bool        | Whether to projectivize the dependency tree. Defaults to `False`.                                                                                                                                                                      |
 | **RETURNS**       | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
 ## GoldParse.\_\_len\_\_ {#len tag="method"}
@ -44,7 +44,7 @@ Whether the provided syntactic annotations form a projective dependency tree.
 ## Attributes {#attributes}
 | Name                                 | Type | Description                                                                                                              |
-| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
 | `words`                              | list | The words.                                                                                                               |
 | `tags`                               | list | The part-of-speech tag annotations.                                                                                      |
 | `heads`                              | list | The syntactic head annotations.                                                                                          |
@ -61,7 +61,8 @@ Whether the provided syntactic annotations form a projective dependency tree.
 Convert a list of Doc objects into the
 [JSON-serializable format](/api/annotation#json-input) used by the
-[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
+[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
 'paragraph' in the output doc.
 > #### Example
 >
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects.
 | Name        | Type         | Description                                                                                                                                                              |
 | ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `doclike`   | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3)..                                                                                                                    |
+| `doclike`   | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).                                                                                                                     |
 | **RETURNS** | list         | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
 <Infobox title="Important note" variant="warning">
--- a/website/docs/usage/101/_pos-deps.md
+++ b/website/docs/usage/101/_pos-deps.md
@ -36,7 +36,7 @@ for token in doc:
 | Text    | Lemma   | POS     | Tag   | Dep        | Shape   | alpha   | stop    |
 | ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
 | Apple   | apple   | `PROPN` | `NNP` | `nsubj`    | `Xxxxx` | `True`  | `False` |
-| is      | be      | `VERB`  | `VBZ` | `aux`      | `xx`    | `True`  | `True`  |
+| is      | be      | `AUX`   | `VBZ` | `aux`      | `xx`    | `True`  | `True`  |
 | looking | look    | `VERB`  | `VBG` | `ROOT`     | `xxxx`  | `True`  | `False` |
 | at      | at      | `ADP`   | `IN`  | `prep`     | `xx`    | `True`  | `True`  |
 | buying  | buy     | `VERB`  | `VBG` | `pcomp`    | `xxxx`  | `True`  | `False` |
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole
 documents**, not just single sentences. If your corpus only contains single
 sentences, spaCy's models will never learn to expect multi-sentence documents,
 leading to low performance on real text. To mitigate this problem, you can use
-the `-N` argument to the `spacy convert` command, to merge some of the sentences
+the `-n` argument to the `spacy convert` command, to merge some of the sentences
 into longer pseudo-documents.
 ### Training the tagger and parser {#train-tagger-parser}
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.")
 print("Before", doc.ents)  # []
 header = [ENT_IOB, ENT_TYPE]
-attr_array = numpy.zeros((len(doc), len(header)))
+attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
 attr_array[0, 0] = 3  # B
 attr_array[0, 1] = doc.vocab.strings["GPE"]
 doc.from_array(header, attr_array)
@ -1143,9 +1143,9 @@ from spacy.gold import align
 other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
 spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
 cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
-print("Misaligned tokens:", cost)  # 2
+print("Edit distance:", cost)  # 3
 print("One-to-one mappings a -> b", a2b)  # array([0, 1, 2, 3, -1, -1, 5, 6])
-print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, 5, 6, 7])
+print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, -1, 6, 7])
 print("Many-to-one mappings a -> b", a2b_multi)  # {4: 4, 5: 4}
 print("Many-to-one mappings b-> a", b2a_multi)  # {}
 ```
@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi)  # {}
 Here are some insights from the alignment information generated in the example
 above:
- Two tokens are misaligned.
+- The edit distance (cost) is `3`: two deletions and one insertion.
 - The one-to-one mappings for the first four tokens are identical, which means
  they map to each other. This makes sense because they're also identical in the
  input: `"i"`, `"listened"`, `"to"` and `"obama"`.
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
   better segmentation for Chinese OntoNotes and the new
   [Chinese models](/models/zh).
 <Infobox variant="warning">
 Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
 with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
 install it from our fork and compile it locally:
 ```bash
 $ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
 ```
 </Infobox>
 <Accordion title="Details on spaCy's PKUSeg API">
 The `meta` argument of the `Chinese` language class supports the following
@ -196,12 +208,20 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
 The Japanese language class uses
 [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. The default Japanese language class
+segmentation and part-of-speech tagging. The default Japanese language class and
-and the provided Japanese models use SudachiPy split mode `A`.
+the provided Japanese models use SudachiPy split mode `A`.
 The `meta` argument of the `Japanese` language class can be used to configure
 the split mode to `A`, `B` or `C`.
 <Infobox variant="warning">
 If you run into errors related to `sudachipy`, which is currently under active
 development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
 used for training the current [Japanese models](/models/ja).
 </Infobox>
 ## Installing and using models {#download}
 > #### Downloading models in spaCy < v1.7
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@ -1158,17 +1158,17 @@ what you need for your application.
 > available corpus.
 For example, the corpus spaCy's [English models](/models/en) were trained on
-defines a `PERSON` entity as just the **person name**, without titles like "Mr"
+defines a `PERSON` entity as just the **person name**, without titles like "Mr."
-or "Dr". This makes sense, because it makes it easier to resolve the entity type
+or "Dr.". This makes sense, because it makes it easier to resolve the entity
-back to a knowledge base. But what if your application needs the full names,
+type back to a knowledge base. But what if your application needs the full
-_including_ the titles?
+names, _including_ the titles?
 ```python
 ### {executable="true"}
 import spacy
 nlp = spacy.load("en_core_web_sm")
-doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
 print([(ent.text, ent.label_) for ent in doc.ents])
 ```
@ -1233,7 +1233,7 @@ def expand_person_entities(doc):
 # Add the component after the named entity recognizer
 nlp.add_pipe(expand_person_entities, after='ner')
-doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
 print([(ent.text, ent.label_) for ent in doc.ents])
 ```
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
 vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
 and Romanian** and updated the training data and vectors for most languages.
 Model packages with vectors are about **2&times** smaller on disk and load
-**2-4&times;** faster. For the full changelog, see the [release notes on
+**2-4&times;** faster. For the full changelog, see the
-GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
+[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
-details and a behind-the-scenes look at the new release, [see our blog
+For more details and a behind-the-scenes look at the new release,
-post](https://explosion.ai/blog/spacy-v2-3).
+[see our blog post](https://explosion.ai/blog/spacy-v2-3).
 ### Expanded model families with vectors {#models}
@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).
 With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
 `md` and `lg` models with word vectors for all languages, this release provides
-a total of 46 model packages. For models trained using [Universal
+a total of 46 model packages. For models trained using
-Dependencies](https://universaldependencies.org) corpora, the training data has
+[Universal Dependencies](https://universaldependencies.org) corpora, the
-been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
+training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
-extended to include both UD Dutch Alpino and LassySmall.
+and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
 <Infobox>
@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
 ### Chinese {#chinese}
 > #### Example
 >
 > ```python
 > from spacy.lang.zh import Chinese
 >
@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
 >
 > # Append words to user dict
 > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
 > ```
 This release adds support for
-[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
+[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
-the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
+the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
-The Chinese tokenizer can be initialized with both `pkuseg` and custom models
+Chinese tokenizer can be initialized with both `pkuseg` and custom models and
-and the `pkuseg` user dictionary is easy to customize.
+the `pkuseg` user dictionary is easy to customize. Note that
 [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
 pre-compiled wheels for Python 3.8. See the
 [usage documentation](/usage/models#chinese) for details on how to install it on
 Python 3.8.
 <Infobox>
-**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
+**Models:** [Chinese models](/models/zh) **Usage: **
 [Chinese tokenizer usage](/usage/models#chinese)
 </Infobox>
 ### Japanese {#japanese}
 The updated Japanese language class switches to
-[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
+[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
+segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
 installing spaCy for Japanese, which is now possible with a single command:
 `pip install spacy[ja]`.
 <Infobox>
-**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
+**Models:** [Japanese models](/models/ja) **Usage:**
 [Japanese tokenizer usage](/usage/models#japanese)
 </Infobox>
 ### Small CLI updates
- `spacy debug-data` provides the coverage of the vectors in a base model with
+- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
-  `spacy debug-data lang train dev -b base_model`
+  in a base model with `spacy debug-data lang train dev -b base_model`
- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
+- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
-  dev.json`) to evaluate the tokenization accuracy without loading a model
+  `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
- `spacy train` on GPU restricts the CPU timing evaluation to the first
+  without loading a model
-  iteration
+- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
  the first iteration
 ## Backwards incompatibilities {#incompat}
@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
 If you've been training **your own models**, you'll need to **retrain** them
 with the new version. Also don't forget to upgrade all models to the latest
 versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
-with models for v2.3. To check if all of your models are up to date, you can
+with models for v2.3. To check if all of your models are up to date, you can run
-run the [`spacy validate`](/api/cli#validate) command.
+the [`spacy validate`](/api/cli#validate) command.
 </Infobox>
@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
 > directly.
 - If you're training new models, you'll want to install the package
-  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
+  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
-  which now includes both the lemmatization tables (as in v2.2) and the
+  now includes both the lemmatization tables (as in v2.2) and the normalization
-  normalization tables (new in v2.3). If you're using pretrained models,
+  tables (new in v2.3). If you're using pretrained models, **nothing changes**,
-  **nothing changes**, because the relevant tables are included in the model
+  because the relevant tables are included in the model packages.
  packages.
 - Due to the updated Universal Dependencies training data, the fine-grained
  part-of-speech tags will change for many provided language models. The
  coarse-grained part-of-speech tagset remains the same, but the mapping from
  particular fine-grained to coarse-grained tags may show minor differences.
 - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
-  tagsets contain new merged tags related to contracted forms, such as
+  tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
-  `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
+  for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
-  `"à"`. This increases the accuracy of the models by improving the alignment
+  increases the accuracy of the models by improving the alignment between
-  between spaCy's tokenization and Universal Dependencies multi-word tokens
+  spaCy's tokenization and Universal Dependencies multi-word tokens used for
-  used for contractions.
+  contractions.
 ### Migrating from spaCy 2.2 {#migrating}
@ -143,29 +151,81 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
 and earlier versions.
 A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
-cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
+cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
-a comma at the end of a URL) before applying the match. See the full [tokenizer
+comma at the end of a URL) before applying the match. See the full
-documentation](/usage/linguistic-features#tokenization) and try out
+[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
 [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
 debugging your tokenizer configuration.
 #### Warnings configuration
-spaCy's custom warnings have been replaced with native python
+spaCy's custom warnings have been replaced with native Python
 [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
-setting `SPACY_WARNING_IGNORE`, use the [warnings
+setting `SPACY_WARNING_IGNORE`, use the [`warnings`
 filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
 to manage warnings.
 ```diff
 import spacy
 + import warnings
 - spacy.errors.SPACY_WARNING_IGNORE.append('W007')
 + warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
 ```
 #### Normalization tables
 The normalization tables have moved from the language data in
-[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
+[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
-the package
+package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
-[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
+If you're adding data for a new language, the normalization table should be
-you're adding data for a new language, the normalization table should be added
+added to `spacy-lookups-data`. See
-to `spacy-lookups-data`. See [adding norm
+[adding norm exceptions](/usage/adding-languages#norm-exceptions).
-exceptions](/usage/adding-languages#norm-exceptions).
+
 #### No preloaded lexemes/vocab for models with vectors
 To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
 loaded on initialization for models with vectors. As you process texts, the
 lexemes will be added to the vocab automatically, just as in models without
 vectors.
 To see the number of unique vectors and number of words with vectors, see
 `nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
 unique vectors and `684830` words with vectors:
 ```python
 {
    'width': 300,
    'vectors': 20000,
    'keys': 684830,
    'name': 'en_core_web_md.vectors'
 }
 ```
 If required, for instance if you are working directly with word vectors rather
 than processing texts, you can load all lexemes for words with vectors at once:
 ```python
 for orth in nlp.vocab.vectors:
    _ = nlp.vocab[orth]
 ```
 #### Lexeme.is_oov and Token.is_oov
 <Infobox title="Important note" variant="warning">
 Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
 fixed in the next patch release v2.3.1.
 </Infobox>
 In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
 have a word vector. This is equivalent to `token.orth not in
 nlp.vocab.vectors`.
 Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
 probability and cluster features. The probability and cluster features are no
 longer included in the provided medium and large models (see the next section).
 #### Probability and cluster features
@ -181,28 +241,28 @@ exceptions](/usage/adding-languages#norm-exceptions).
 The `Token.prob` and `Token.cluster` features, which are no longer used by the
 core pipeline components as of spaCy v2, are no longer provided in the
-pretrained models to reduce the model size. To keep these features available
+pretrained models to reduce the model size. To keep these features available for
-for users relying on them, the `prob` and `cluster` features for the most
+users relying on them, the `prob` and `cluster` features for the most frequent
-frequent 1M tokens have been moved to
+1M tokens have been moved to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
 `extra` features for the relevant languages (English, German, Greek and
 Spanish).
 The extra tables are loaded lazily, so if you have `spacy-lookups-data`
-installed and your code accesses `Token.prob`, the full table is loaded into
+installed and your code accesses `Token.prob`, the full table is loaded into the
-the model vocab, which will take a few seconds on initial loading. When you
+model vocab, which will take a few seconds on initial loading. When you save
-save this model after loading the `prob` table, the full `prob` table will be
+this model after loading the `prob` table, the full `prob` table will be saved
-saved as part of the model vocab.
+as part of the model vocab.
-If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
+If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
-part of a new model, add the data to
+of a new model, add the data to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
 the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
 initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
 [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
 `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
-currently only used to provide a custom `oov_prob`. See examples in the [`data`
+currently only used to provide a custom `oov_prob`. See examples in the
-directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
+[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
 in `spacy-lookups-data`.
 #### Initializing new models without extra lookups tables
--- a/website/meta/site.json
+++ b/website/meta/site.json
@ -23,9 +23,9 @@
        "apiKey": "371e26ed49d29a27bd36273dfdaf89af",
        "indexName": "spacy"
    },
-    "binderUrl": "ines/spacy-io-binder",
+    "binderUrl": "explosion/spacy-io-binder",
    "binderBranch": "live",
-    "binderVersion": "2.2.0",
+    "binderVersion": "2.3.0",
    "sections": [
        { "id": "usage", "title": "Usage Documentation", "theme": "blue" },
        { "id": "models", "title": "Models Documentation", "theme": "blue" },