diff --git a/.github/contributors/mahnerak.md b/.github/contributors/mahnerak.md
deleted file mode 100644
index cc7739681..000000000
--- a/.github/contributors/mahnerak.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# spaCy contributor agreement
-
-This spaCy Contributor Agreement (**"SCA"**) is based on the
-[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
-The SCA applies to any contribution that you make to any product or project
-managed by us (the **"project"**), and sets out the intellectual property rights
-you grant to us in the contributed materials. The term **"us"** shall mean
-[ExplosionAI GmbH](https://explosion.ai/legal). The term
-**"you"** shall mean the person or entity identified below.
-
-If you agree to be bound by these terms, fill in the information requested
-below and include the filled-in version with your first pull request, under the
-folder [`.github/contributors/`](/.github/contributors/). The name of the file
-should be your GitHub username, with the extension `.md`. For example, the user
-example_user would create the file `.github/contributors/example_user.md`.
-
-Read this agreement carefully before signing. These terms and conditions
-constitute a binding legal agreement.
-
-## Contributor Agreement
-
-1. The term "contribution" or "contributed materials" means any source code,
-object code, patch, tool, sample, graphic, specification, manual,
-documentation, or any other material posted or submitted by you to the project.
-
-2. With respect to any worldwide copyrights, or copyright applications and
-registrations, in your contribution:
-
-    * you hereby assign to us joint ownership, and to the extent that such
-    assignment is or becomes invalid, ineffective or unenforceable, you hereby
-    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
-    royalty-free, unrestricted license to exercise all rights under those
-    copyrights. This includes, at our option, the right to sublicense these same
-    rights to third parties through multiple levels of sublicensees or other
-    licensing arrangements;
-
-    * you agree that each of us can do all things in relation to your
-    contribution as if each of us were the sole owners, and if one of us makes
-    a derivative work of your contribution, the one who makes the derivative
-    work (or has it made will be the sole owner of that derivative work;
-
-    * you agree that you will not assert any moral rights in your contribution
-    against us, our licensees or transferees;
-
-    * you agree that we may register a copyright in your contribution and
-    exercise all ownership rights associated with it; and
-
-    * you agree that neither of us has any duty to consult with, obtain the
-    consent of, pay or render an accounting to the other for any use or
-    distribution of your contribution.
-
-3. With respect to any patents you own, or that you can license without payment
-to any third party, you hereby grant to us a perpetual, irrevocable,
-non-exclusive, worldwide, no-charge, royalty-free license to:
-
-    * make, have made, use, sell, offer to sell, import, and otherwise transfer
-    your contribution in whole or in part, alone or in combination with or
-    included in any product, work or materials arising out of the project to
-    which your contribution was submitted, and
-
-    * at our option, to sublicense these same rights to third parties through
-    multiple levels of sublicensees or other licensing arrangements.
-
-4. Except as set out above, you keep all right, title, and interest in your
-contribution. The rights that you grant to us under these terms are effective
-on the date you first submitted a contribution to us, even if your submission
-took place before the date you sign these terms.
-
-5. You covenant, represent, warrant and agree that:
-
-    * Each contribution that you submit is and shall be an original work of
-    authorship and you can legally grant the rights set out in this SCA;
-
-    * to the best of your knowledge, each contribution will not violate any
-    third party's copyrights, trademarks, patents, or other intellectual
-    property rights; and
-
-    * each contribution shall be in compliance with U.S. export control laws and
-    other applicable export and import laws. You agree to notify us if you
-    become aware of any circumstance which would make any of the foregoing
-    representations inaccurate in any respect. We may publicly disclose your
-    participation in the project, including the fact that you have signed the SCA.
-
-6. This SCA is governed by the laws of the State of California and applicable
-U.S. Federal law. Any choice of law rules will not apply.
-
-7. Please place an “x” on one of the applicable statement below. Please do NOT
-mark both statements:
-
-    * [x] I am signing on behalf of myself as an individual and no other person
-    or entity, including my employer, has or will have rights with respect to my
-    contributions.
-
-    * [ ] I am signing on behalf of my employer or a legal entity and I have the
-    actual authority to contractually bind that entity.
-
-## Contributor Details
-
-| Field                          | Entry                |
-|------------------------------- | -------------------- |
-| Name                           | Karen Hambardzumyan  |
-| Company name (if applicable)   | YerevaNN             |
-| Title or role (if applicable)  | Researcher           |
-| Date                           | 2020-06-19           |
-| GitHub username                | mahnerak             |
-| Website (optional)             | https://mahnerak.com/|
diff --git a/.github/contributors/myavrum.md b/.github/contributors/myavrum.md
deleted file mode 100644
index dc8f1bb84..000000000
--- a/.github/contributors/myavrum.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# spaCy contributor agreement
-
-This spaCy Contributor Agreement (**"SCA"**) is based on the
-[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
-The SCA applies to any contribution that you make to any product or project
-managed by us (the **"project"**), and sets out the intellectual property rights
-you grant to us in the contributed materials. The term **"us"** shall mean
-[ExplosionAI GmbH](https://explosion.ai/legal). The term
-**"you"** shall mean the person or entity identified below.
-
-If you agree to be bound by these terms, fill in the information requested
-below and include the filled-in version with your first pull request, under the
-folder [`.github/contributors/`](/.github/contributors/). The name of the file
-should be your GitHub username, with the extension `.md`. For example, the user
-example_user would create the file `.github/contributors/example_user.md`.
-
-Read this agreement carefully before signing. These terms and conditions
-constitute a binding legal agreement.
-
-## Contributor Agreement
-
-1. The term "contribution" or "contributed materials" means any source code,
-object code, patch, tool, sample, graphic, specification, manual,
-documentation, or any other material posted or submitted by you to the project.
-
-2. With respect to any worldwide copyrights, or copyright applications and
-registrations, in your contribution:
-
-    * you hereby assign to us joint ownership, and to the extent that such
-    assignment is or becomes invalid, ineffective or unenforceable, you hereby
-    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
-    royalty-free, unrestricted license to exercise all rights under those
-    copyrights. This includes, at our option, the right to sublicense these same
-    rights to third parties through multiple levels of sublicensees or other
-    licensing arrangements;
-
-    * you agree that each of us can do all things in relation to your
-    contribution as if each of us were the sole owners, and if one of us makes
-    a derivative work of your contribution, the one who makes the derivative
-    work (or has it made will be the sole owner of that derivative work;
-
-    * you agree that you will not assert any moral rights in your contribution
-    against us, our licensees or transferees;
-
-    * you agree that we may register a copyright in your contribution and
-    exercise all ownership rights associated with it; and
-
-    * you agree that neither of us has any duty to consult with, obtain the
-    consent of, pay or render an accounting to the other for any use or
-    distribution of your contribution.
-
-3. With respect to any patents you own, or that you can license without payment
-to any third party, you hereby grant to us a perpetual, irrevocable,
-non-exclusive, worldwide, no-charge, royalty-free license to:
-
-    * make, have made, use, sell, offer to sell, import, and otherwise transfer
-    your contribution in whole or in part, alone or in combination with or
-    included in any product, work or materials arising out of the project to
-    which your contribution was submitted, and
-
-    * at our option, to sublicense these same rights to third parties through
-    multiple levels of sublicensees or other licensing arrangements.
-
-4. Except as set out above, you keep all right, title, and interest in your
-contribution. The rights that you grant to us under these terms are effective
-on the date you first submitted a contribution to us, even if your submission
-took place before the date you sign these terms.
-
-5. You covenant, represent, warrant and agree that:
-
-    * Each contribution that you submit is and shall be an original work of
-    authorship and you can legally grant the rights set out in this SCA;
-
-    * to the best of your knowledge, each contribution will not violate any
-    third party's copyrights, trademarks, patents, or other intellectual
-    property rights; and
-
-    * each contribution shall be in compliance with U.S. export control laws and
-    other applicable export and import laws. You agree to notify us if you
-    become aware of any circumstance which would make any of the foregoing
-    representations inaccurate in any respect. We may publicly disclose your
-    participation in the project, including the fact that you have signed the SCA.
-
-6. This SCA is governed by the laws of the State of California and applicable
-U.S. Federal law. Any choice of law rules will not apply.
-
-7. Please place an “x” on one of the applicable statement below. Please do NOT
-mark both statements:
-
-    * [x] I am signing on behalf of myself as an individual and no other person
-    or entity, including my employer, has or will have rights with respect to my
-    contributions.
-
-    * [ ] I am signing on behalf of my employer or a legal entity and I have the
-    actual authority to contractually bind that entity.
-
-## Contributor Details
-
-| Field                          | Entry                |
-|------------------------------- | -------------------- |
-| Name                           | Marat M. Yavrumyan   |
-| Company name (if applicable)   | YSU, UD_Armenian Project |
-| Title or role (if applicable)  | Dr., Principal Investigator |
-| Date                           | 2020-06-19           |
-| GitHub username                | myavrum              |
-| Website (optional)             | http://armtreebank.yerevann.com/ |
diff --git a/.github/contributors/rameshhpathak.md b/.github/contributors/rameshhpathak.md
deleted file mode 100644
index 30a543307..000000000
--- a/.github/contributors/rameshhpathak.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# spaCy contributor agreement
-
-This spaCy Contributor Agreement (**"SCA"**) is based on the
-[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
-The SCA applies to any contribution that you make to any product or project
-managed by us (the **"project"**), and sets out the intellectual property rights
-you grant to us in the contributed materials. The term **"us"** shall mean
-[ExplosionAI GmbH](https://explosion.ai/legal). The term
-**"you"** shall mean the person or entity identified below.
-
-If you agree to be bound by these terms, fill in the information requested
-below and include the filled-in version with your first pull request, under the
-folder [`.github/contributors/`](/.github/contributors/). The name of the file
-should be your GitHub username, with the extension `.md`. For example, the user
-example_user would create the file `.github/contributors/example_user.md`.
-
-Read this agreement carefully before signing. These terms and conditions
-constitute a binding legal agreement.
-
-## Contributor Agreement
-
-1. The term "contribution" or "contributed materials" means any source code,
-object code, patch, tool, sample, graphic, specification, manual,
-documentation, or any other material posted or submitted by you to the project.
-
-2. With respect to any worldwide copyrights, or copyright applications and
-registrations, in your contribution:
-
-    * you hereby assign to us joint ownership, and to the extent that such
-    assignment is or becomes invalid, ineffective or unenforceable, you hereby
-    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
-    royalty-free, unrestricted license to exercise all rights under those
-    copyrights. This includes, at our option, the right to sublicense these same
-    rights to third parties through multiple levels of sublicensees or other
-    licensing arrangements;
-
-    * you agree that each of us can do all things in relation to your
-    contribution as if each of us were the sole owners, and if one of us makes
-    a derivative work of your contribution, the one who makes the derivative
-    work (or has it made will be the sole owner of that derivative work;
-
-    * you agree that you will not assert any moral rights in your contribution
-    against us, our licensees or transferees;
-
-    * you agree that we may register a copyright in your contribution and
-    exercise all ownership rights associated with it; and
-
-    * you agree that neither of us has any duty to consult with, obtain the
-    consent of, pay or render an accounting to the other for any use or
-    distribution of your contribution.
-
-3. With respect to any patents you own, or that you can license without payment
-to any third party, you hereby grant to us a perpetual, irrevocable,
-non-exclusive, worldwide, no-charge, royalty-free license to:
-
-    * make, have made, use, sell, offer to sell, import, and otherwise transfer
-    your contribution in whole or in part, alone or in combination with or
-    included in any product, work or materials arising out of the project to
-    which your contribution was submitted, and
-
-    * at our option, to sublicense these same rights to third parties through
-    multiple levels of sublicensees or other licensing arrangements.
-
-4. Except as set out above, you keep all right, title, and interest in your
-contribution. The rights that you grant to us under these terms are effective
-on the date you first submitted a contribution to us, even if your submission
-took place before the date you sign these terms.
-
-5. You covenant, represent, warrant and agree that:
-
-    * Each contribution that you submit is and shall be an original work of
-    authorship and you can legally grant the rights set out in this SCA;
-
-    * to the best of your knowledge, each contribution will not violate any
-    third party's copyrights, trademarks, patents, or other intellectual
-    property rights; and
-
-    * each contribution shall be in compliance with U.S. export control laws and
-    other applicable export and import laws. You agree to notify us if you
-    become aware of any circumstance which would make any of the foregoing
-    representations inaccurate in any respect. We may publicly disclose your
-    participation in the project, including the fact that you have signed the SCA.
-
-6. This SCA is governed by the laws of the State of California and applicable
-U.S. Federal law. Any choice of law rules will not apply.
-
-7. Please place an “x” on one of the applicable statement below. Please do NOT
-mark both statements:
-
-    * [x] I am signing on behalf of myself as an individual and no other person
-    or entity, including my employer, has or will have rights with respect to my
-    contributions.
-
-    * [ ] I am signing on behalf of my employer or a legal entity and I have the
-    actual authority to contractually bind that entity.
-
-## Contributor Details
-
-| Field                          | Entry                |
-|------------------------------- | -------------------- |
-| Name                           | Ramesh Pathak        |
-| Company name (if applicable)   | Diyo AI              |
-| Title or role (if applicable)  | AI Engineer          |
-| Date                           | June 21, 2020        |
-| GitHub username                | rameshhpathak        |
-| Website (optional)             |rameshhpathak.github.io|                      |
diff --git a/.github/contributors/richardliaw.md b/.github/contributors/richardliaw.md
deleted file mode 100644
index 2af4ce840..000000000
--- a/.github/contributors/richardliaw.md
+++ /dev/null
@@ -1,106 +0,0 @@
-# spaCy contributor agreement
-
-This spaCy Contributor Agreement (**"SCA"**) is based on the
-[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
-The SCA applies to any contribution that you make to any product or project
-managed by us (the **"project"**), and sets out the intellectual property rights
-you grant to us in the contributed materials. The term **"us"** shall mean
-[ExplosionAI GmbH](https://explosion.ai/legal). The term
-**"you"** shall mean the person or entity identified below.
-
-If you agree to be bound by these terms, fill in the information requested
-below and include the filled-in version with your first pull request, under the
-folder [`.github/contributors/`](/.github/contributors/). The name of the file
-should be your GitHub username, with the extension `.md`. For example, the user
-example_user would create the file `.github/contributors/example_user.md`.
-
-Read this agreement carefully before signing. These terms and conditions
-constitute a binding legal agreement.
-
-## Contributor Agreement
-
-1. The term "contribution" or "contributed materials" means any source code,
-object code, patch, tool, sample, graphic, specification, manual,
-documentation, or any other material posted or submitted by you to the project.
-
-2. With respect to any worldwide copyrights, or copyright applications and
-registrations, in your contribution:
-
-    * you hereby assign to us joint ownership, and to the extent that such
-    assignment is or becomes invalid, ineffective or unenforceable, you hereby
-    grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
-    royalty-free, unrestricted license to exercise all rights under those
-    copyrights. This includes, at our option, the right to sublicense these same
-    rights to third parties through multiple levels of sublicensees or other
-    licensing arrangements;
-
-    * you agree that each of us can do all things in relation to your
-    contribution as if each of us were the sole owners, and if one of us makes
-    a derivative work of your contribution, the one who makes the derivative
-    work (or has it made will be the sole owner of that derivative work;
-
-    * you agree that you will not assert any moral rights in your contribution
-    against us, our licensees or transferees;
-
-    * you agree that we may register a copyright in your contribution and
-    exercise all ownership rights associated with it; and
-
-    * you agree that neither of us has any duty to consult with, obtain the
-    consent of, pay or render an accounting to the other for any use or
-    distribution of your contribution.
-
-3. With respect to any patents you own, or that you can license without payment
-to any third party, you hereby grant to us a perpetual, irrevocable,
-non-exclusive, worldwide, no-charge, royalty-free license to:
-
-    * make, have made, use, sell, offer to sell, import, and otherwise transfer
-    your contribution in whole or in part, alone or in combination with or
-    included in any product, work or materials arising out of the project to
-    which your contribution was submitted, and
-
-    * at our option, to sublicense these same rights to third parties through
-    multiple levels of sublicensees or other licensing arrangements.
-
-4. Except as set out above, you keep all right, title, and interest in your
-contribution. The rights that you grant to us under these terms are effective
-on the date you first submitted a contribution to us, even if your submission
-took place before the date you sign these terms.
-
-5. You covenant, represent, warrant and agree that:
-
-    * Each contribution that you submit is and shall be an original work of
-    authorship and you can legally grant the rights set out in this SCA;
-
-    * to the best of your knowledge, each contribution will not violate any
-    third party's copyrights, trademarks, patents, or other intellectual
-    property rights; and
-
-    * each contribution shall be in compliance with U.S. export control laws and
-    other applicable export and import laws. You agree to notify us if you
-    become aware of any circumstance which would make any of the foregoing
-    representations inaccurate in any respect. We may publicly disclose your
-    participation in the project, including the fact that you have signed the SCA.
-
-6. This SCA is governed by the laws of the State of California and applicable
-U.S. Federal law. Any choice of law rules will not apply.
-
-7. Please place an “x” on one of the applicable statement below. Please do NOT
-mark both statements:
-
-    * [x] I am signing on behalf of myself as an individual and no other person
-    or entity, including my employer, has or will have rights with respect to my
-    contributions.
-
-    * [ ] I am signing on behalf of my employer or a legal entity and I have the
-    actual authority to contractually bind that entity.
-
-## Contributor Details
-
-| Field                          | Entry                |
-|------------------------------- | -------------------- |
-| Name                           | Richard Liaw         |
-| Company name (if applicable)   |                      |
-| Title or role (if applicable)  |                      |
-| Date                           | 06/22/2020           |
-| GitHub username                | richardliaw          |
-| Website (optional)             |                      |
\ No newline at end of file
diff --git a/spacy/lang/hy/examples.py b/spacy/lang/hy/examples.py
index 8a00fd243..323f77b1c 100644
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models.
 sentences = [
     "Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
     "Ո՞վ է Ֆրանսիայի նախագահը։",
-    "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
+    "Որն է Միացյալ Նահանգների մայրաքաղաքը։",
     "Ե՞րբ է ծնվել Բարաք Օբաման։",
 ]
diff --git a/spacy/lang/hy/lex_attrs.py b/spacy/lang/hy/lex_attrs.py
index dea3c0e97..910625fb8 100644
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM
 
 
 _num_words = [
-    "զրո",
-    "մեկ",
+    "զրօ",
+    "մէկ",
     "երկու",
     "երեք",
     "չորս",
@@ -18,21 +18,20 @@ _num_words = [
     "տասը",
     "տասնմեկ",
     "տասներկու",
-    "տասներեք",
-    "տասնչորս",
-    "տասնհինգ",
-    "տասնվեց",
-    "տասնյոթ",
-    "տասնութ",
-    "տասնինը",
-    "քսան",
-    "երեսուն",
+    "տասն­երեք",
+    "տասն­չորս",
+    "տասն­հինգ",
+    "տասն­վեց",
+    "տասն­յոթ",
+    "տասն­ութ",
+    "տասն­ինը",
+    "քսան" "երեսուն",
     "քառասուն",
     "հիսուն",
-    "վաթսուն",
+    "վաթցսուն",
     "յոթանասուն",
     "ութսուն",
-    "իննսուն",
+    "ինիսուն",
     "հարյուր",
     "հազար",
     "միլիոն",
diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py
index fb8b9d7fe..a7ad0846e 100644
--- a/spacy/lang/ja/__init__.py
+++ b/spacy/lang/ja/__init__.py
@@ -20,7 +20,12 @@ from ... import util
 
 
 # Hold the attributes we need with convenient names
-DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"])
+DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
+
+# Handling for multiple spaces in a row is somewhat awkward, this simplifies
+# the flow by creating a dummy with the same interface.
+DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
+DummySpace = DummyNode(" ", " ", " ")
 
 
 def try_sudachi_import(split_mode="A"):
@@ -48,7 +53,7 @@ def try_sudachi_import(split_mode="A"):
         )
 
 
-def resolve_pos(orth, tag, next_tag):
+def resolve_pos(orth, pos, next_pos):
     """If necessary, add a field to the POS tag for UD mapping.
     Under Universal Dependencies, sometimes the same Unidic POS tag can
     be mapped differently depending on the literal token or its context
@@ -59,77 +64,124 @@ def resolve_pos(orth, tag, next_tag):
     # Some tokens have their UD tag decided based on the POS of the following
     # token.
 
-    # apply orth based mapping
-    if tag in TAG_ORTH_MAP:
-        orth_map = TAG_ORTH_MAP[tag]
+    # orth based rules
+    if pos[0] in TAG_ORTH_MAP:
+        orth_map = TAG_ORTH_MAP[pos[0]]
         if orth in orth_map:
-            return orth_map[orth], None  # current_pos, next_pos
+            return orth_map[orth], None
 
-    # apply tag bi-gram mapping
-    if next_tag:
-        tag_bigram = tag, next_tag
+    # tag bi-gram mapping
+    if next_pos:
+        tag_bigram = pos[0], next_pos[0]
         if tag_bigram in TAG_BIGRAM_MAP:
-            current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
-            if current_pos is None:  # apply tag uni-gram mapping for current_pos
-                return TAG_MAP[tag][POS], next_pos  # only next_pos is identified by tag bi-gram mapping
+            bipos = TAG_BIGRAM_MAP[tag_bigram]
+            if bipos[0] is None:
+                return TAG_MAP[pos[0]][POS], bipos[1]
             else:
-                return current_pos, next_pos
+                return bipos
 
-    # apply tag uni-gram mapping
-    return TAG_MAP[tag][POS], None
+    return TAG_MAP[pos[0]][POS], None
 
 
-def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
-    # Compare the content of tokens and text, first
+# Use a mapping of paired punctuation to avoid splitting quoted sentences.
+pairpunct = {'「':'」', '『': '』', '【': '】'}
+
+
+def separate_sentences(doc):
+    """Given a doc, mark tokens that start sentences based on Unidic tags.
+    """
+
+    stack = [] # save paired punctuation
+
+    for i, token in enumerate(doc[:-2]):
+        # Set all tokens after the first to false by default. This is necessary
+        # for the doc code to be aware we've done sentencization, see
+        # `is_sentenced`.
+        token.sent_start = (i == 0)
+        if token.tag_:
+            if token.tag_ == "補助記号-括弧開":
+                ts = str(token)
+                if ts in pairpunct:
+                    stack.append(pairpunct[ts])
+                elif stack and ts == stack[-1]:
+                    stack.pop()
+
+            if token.tag_ == "補助記号-句点":
+                next_token = doc[i+1]
+                if next_token.tag_ != token.tag_ and not stack:
+                    next_token.sent_start = True
+
+
+def get_dtokens(tokenizer, text):
+    tokens = tokenizer.tokenize(text)
+    words = []
+    for ti, token in enumerate(tokens):
+        tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*'])
+        inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*'])
+        dtoken = DetailedToken(
+                token.surface(),
+                (tag, inf),
+                token.dictionary_form())
+        if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白':
+            # don't add multiple space tokens in a row
+            continue
+        words.append(dtoken)
+
+    # remove empty tokens. These can be produced with characters like … that
+    # Sudachi normalizes internally. 
+    words = [ww for ww in words if len(ww.surface) > 0]
+    return words
+
+
+def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
     words = [x.surface for x in dtokens]
     if "".join("".join(words).split()) != "".join(text.split()):
         raise ValueError(Errors.E194.format(text=text, words=words))
-
-    text_dtokens = []
+    text_words = []
+    text_lemmas = []
+    text_tags = []
     text_spaces = []
     text_pos = 0
     # handle empty and whitespace-only texts
     if len(words) == 0:
-        return text_dtokens, text_spaces
+        return text_words, text_lemmas, text_tags, text_spaces
     elif len([word for word in words if not word.isspace()]) == 0:
         assert text.isspace()
-        text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)]
+        text_words = [text]
+        text_lemmas = [text]
+        text_tags = [gap_tag]
         text_spaces = [False]
-        return text_dtokens, text_spaces
-
-    # align words and dtokens by referring text, and insert gap tokens for the space char spans
-    for word, dtoken in zip(words, dtokens):
-        # skip all space tokens
-        if word.isspace():
-            continue
+        return text_words, text_lemmas, text_tags, text_spaces
+    # normalize words to remove all whitespace tokens
+    norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()])
+    # align words with text
+    for word, dtoken in zip(norm_words, norm_dtokens):
         try:
             word_start = text[text_pos:].index(word)
         except ValueError:
             raise ValueError(Errors.E194.format(text=text, words=words))
-
-        # space token
         if word_start > 0:
             w = text[text_pos:text_pos + word_start]
-            text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
+            text_words.append(w)
+            text_lemmas.append(w)
+            text_tags.append(gap_tag)
             text_spaces.append(False)
             text_pos += word_start
-
-        # content word
-        text_dtokens.append(dtoken)
+        text_words.append(word)
+        text_lemmas.append(dtoken.lemma)
+        text_tags.append(dtoken.pos)
         text_spaces.append(False)
         text_pos += len(word)
-        # poll a space char after the word
         if text_pos < len(text) and text[text_pos] == " ":
             text_spaces[-1] = True
             text_pos += 1
-
-    # trailing space token
     if text_pos < len(text):
         w = text[text_pos:]
-        text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
+        text_words.append(w)
+        text_lemmas.append(w)
+        text_tags.append(gap_tag)
         text_spaces.append(False)
-
-    return text_dtokens, text_spaces
+    return text_words, text_lemmas, text_tags, text_spaces
 
 
 class JapaneseTokenizer(DummyTokenizer):
@@ -139,78 +191,29 @@ class JapaneseTokenizer(DummyTokenizer):
         self.tokenizer = try_sudachi_import(self.split_mode)
 
     def __call__(self, text):
-        # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
-        sudachipy_tokens = self.tokenizer.tokenize(text)
-        dtokens = self._get_dtokens(sudachipy_tokens)
-        dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
+        dtokens = get_dtokens(self.tokenizer, text)
 
-        # create Doc with tag bi-gram based part-of-speech identification rules
-        words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6
-        sub_tokens_list = list(sub_tokens_list)
+        words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
         doc = Doc(self.vocab, words=words, spaces=spaces)
-        next_pos = None  # for bi-gram rules
-        for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
-            token.tag_ = dtoken.tag
-            if next_pos:  # already identified in previous iteration
+        next_pos = None
+        for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
+            token.tag_ = unidic_tag[0]
+            if next_pos:
                 token.pos = next_pos
                 next_pos = None
             else:
                 token.pos, next_pos = resolve_pos(
                     token.orth_,
-                    dtoken.tag,
-                    tags[idx + 1] if idx + 1 < len(tags) else None
+                    unidic_tag,
+                    unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None
                 )
-            # if there's no lemma info (it's an unk) just use the surface
-            token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
 
-        doc.user_data["inflections"] = inflections
-        doc.user_data["reading_forms"] = readings
-        doc.user_data["sub_tokens"] = sub_tokens_list
+            # if there's no lemma info (it's an unk) just use the surface
+            token.lemma_ = lemma
+        doc.user_data["unidic_tags"] = unidic_tags
 
         return doc
 
-    def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
-        sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
-        dtokens = [
-            DetailedToken(
-                token.surface(),  # orth
-                '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']),  # tag
-                ','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']),  # inf
-                token.dictionary_form(),  # lemma
-                token.reading_form(),  # user_data['reading_forms']
-                sub_tokens_list[idx] if sub_tokens_list else None,  # user_data['sub_tokens']
-            ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
-            # remove empty tokens which can be produced with characters like … that
-        ]
-        # Sudachi normalizes internally and outputs each space char as a token.
-        # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
-        return [
-            t for idx, t in enumerate(dtokens) if
-            idx == 0 or
-            not t.surface.isspace() or t.tag != '空白' or
-            not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白'
-        ]
-
-    def _get_sub_tokens(self, sudachipy_tokens):
-        if self.split_mode is None or self.split_mode == "A":  # do nothing for default split mode
-            return None
-
-        sub_tokens_list = []  # list of (list of list of DetailedToken | None)
-        for token in sudachipy_tokens:
-            sub_a = token.split(self.tokenizer.SplitMode.A)
-            if len(sub_a) == 1:  # no sub tokens
-                sub_tokens_list.append(None)
-            elif self.split_mode == "B":
-                sub_tokens_list.append([self._get_dtokens(sub_a, False)])
-            else:  # "C"
-                sub_b = token.split(self.tokenizer.SplitMode.B)
-                if len(sub_a) == len(sub_b):
-                    dtokens = self._get_dtokens(sub_a, False)
-                    sub_tokens_list.append([dtokens, dtokens])
-                else:
-                    sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)])
-        return sub_tokens_list
-
     def _get_config(self):
         config = OrderedDict(
             (
diff --git a/spacy/lang/ja/bunsetu.py b/spacy/lang/ja/bunsetu.py
new file mode 100644
index 000000000..7c3eee336
--- /dev/null
+++ b/spacy/lang/ja/bunsetu.py
@@ -0,0 +1,144 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+
+
+POS_PHRASE_MAP = {
+    "NOUN": "NP",
+    "NUM": "NP",
+    "PRON": "NP",
+    "PROPN": "NP",
+
+    "VERB": "VP",
+
+    "ADJ": "ADJP",
+
+    "ADV": "ADVP",
+
+    "CCONJ": "CCONJP",
+}
+
+
+# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
+def yield_bunsetu(doc, debug=False):
+    bunsetu = []
+    bunsetu_may_end = False
+    phrase_type = None
+    phrase = None
+    prev = None
+    prev_tag = None
+    prev_dep = None
+    prev_head = None
+    for t in doc:
+        pos = t.pos_
+        pos_type = POS_PHRASE_MAP.get(pos, None)
+        tag = t.tag_
+        dep = t.dep_
+        head = t.head.i
+        if debug:
+            print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu)
+
+        # DET is always an individual bunsetu
+        if pos == "DET":
+            if bunsetu:
+                yield bunsetu, phrase_type, phrase
+            yield [t], None, None
+            bunsetu = []
+            bunsetu_may_end = False
+            phrase_type = None
+            phrase = None
+
+        # PRON or Open PUNCT always splits bunsetu
+        elif tag == "補助記号-括弧開":
+            if bunsetu:
+                yield bunsetu, phrase_type, phrase
+            bunsetu = [t]
+            bunsetu_may_end = True
+            phrase_type = None
+            phrase = None
+
+        # bunsetu head not appeared
+        elif phrase_type is None:
+            if bunsetu and prev_tag == "補助記号-読点":
+                yield bunsetu, phrase_type, phrase
+                bunsetu = []
+                bunsetu_may_end = False
+                phrase_type = None
+                phrase = None
+            bunsetu.append(t)
+            if pos_type:  # begin phrase
+                phrase = [t]
+                phrase_type = pos_type
+                if pos_type in {"ADVP", "CCONJP"}:
+                    bunsetu_may_end = True
+
+        # entering new bunsetu
+        elif pos_type and (
+            pos_type != phrase_type or  # different phrase type arises
+            bunsetu_may_end  # same phrase type but bunsetu already ended
+        ):
+            # exceptional case: NOUN to VERB
+            if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i:
+                bunsetu.append(t)
+                phrase_type = "VP"
+                phrase.append(t)
+            # exceptional case: VERB to NOUN
+            elif phrase_type == "VP" and pos_type == "NP" and (
+                    prev_dep == 'compound' and prev_head == t.i or
+                    dep == 'compound' and prev == head or
+                    prev_dep == 'nmod' and prev_head == t.i
+            ):
+                bunsetu.append(t)
+                phrase_type = "NP"
+                phrase.append(t)
+            else:
+                yield bunsetu, phrase_type, phrase
+                bunsetu = [t]
+                bunsetu_may_end = False
+                phrase_type = pos_type
+                phrase = [t]
+
+        # NOUN bunsetu
+        elif phrase_type == "NP":
+            bunsetu.append(t)
+            if not bunsetu_may_end and ((
+                (pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'}
+            ) or (
+                pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
+            )):
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # VERB bunsetu
+        elif phrase_type == "VP":
+            bunsetu.append(t)
+            if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound':
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # ADJ bunsetu
+        elif phrase_type == "ADJP" and tag != '連体詞':
+            bunsetu.append(t)
+            if not bunsetu_may_end and ((
+                pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'}
+            ) or (
+                pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
+            )):
+                phrase.append(t)
+            else:
+                bunsetu_may_end = True
+
+        # other bunsetu
+        else:
+            bunsetu.append(t)
+
+        prev = t.i
+        prev_tag = t.tag_
+        prev_dep = t.dep_
+        prev_head = head
+
+    if bunsetu:
+        yield bunsetu, phrase_type, phrase
diff --git a/spacy/lang/ne/__init__.py b/spacy/lang/ne/__init__.py
deleted file mode 100644
index 21556277d..000000000
--- a/spacy/lang/ne/__init__.py
+++ /dev/null
@@ -1,23 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-from .stop_words import STOP_WORDS
-from .lex_attrs import LEX_ATTRS
-
-from ...language import Language
-from ...attrs import LANG
-
-
-class NepaliDefaults(Language.Defaults):
-    lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
-    lex_attr_getters.update(LEX_ATTRS)
-    lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
-    stop_words = STOP_WORDS
-
-
-class Nepali(Language):
-    lang = "ne"
-    Defaults = NepaliDefaults
-
-
-__all__ = ["Nepali"]
diff --git a/spacy/lang/ne/examples.py b/spacy/lang/ne/examples.py
deleted file mode 100644
index b3c4f9e73..000000000
--- a/spacy/lang/ne/examples.py
+++ /dev/null
@@ -1,22 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
-"""
-Example sentences to test spaCy and its language models.
-
->>> from spacy.lang.ne.examples import sentences
->>> docs = nlp.pipe(sentences)
-"""
-
-
-sentences = [
-    "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
-    "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
-    "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
-    "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
-    "तिमी कहाँ छौ?",
-    "फ्रान्स को राष्ट्रपति को हो?",
-    "संयुक्त राज्यको राजधानी के हो?",
-    "बराक ओबामा कहिले कहिले जन्मेका हुन्?",
-]
diff --git a/spacy/lang/ne/lex_attrs.py b/spacy/lang/ne/lex_attrs.py
deleted file mode 100644
index 652307577..000000000
--- a/spacy/lang/ne/lex_attrs.py
+++ /dev/null
@@ -1,98 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-from ..norm_exceptions import BASE_NORMS
-from ...attrs import NORM, LIKE_NUM
-
-
-# fmt: off
-_stem_suffixes = [
-    ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
-    ["ँ", "ं", "्", "ः"],
-    ["लाई", "ले", "बाट", "को", "मा", "हरू"],
-    ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
-    ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
-    ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
-    ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
-    ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
-    ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
-    ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
-    ["याइ", "ाइ", "बार", "वार", "चाँहि"],
-    ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
-]
-# fmt: on
-
-# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
-# reference 2: https://www.imnepal.com/nepali-numbers/
-_num_words = [
-    "शुन्य",
-    "एक",
-    "दुई",
-    "तीन",
-    "चार",
-    "पाँच",
-    "छ",
-    "सात",
-    "आठ",
-    "नौ",
-    "दश",
-    "एघार",
-    "बाह्र",
-    "तेह्र",
-    "चौध",
-    "पन्ध्र",
-    "सोह्र",
-    "सोह्र",
-    "सत्र",
-    "अठार",
-    "उन्नाइस",
-    "बीस",
-    "तीस",
-    "चालीस",
-    "पचास",
-    "साठी",
-    "सत्तरी",
-    "असी",
-    "नब्बे",
-    "सय",
-    "हजार",
-    "लाख",
-    "करोड",
-    "अर्ब",
-    "खर्ब",
-]
-
-
-def norm(string):
-    # normalise base exceptions,  e.g. punctuation or currency symbols
-    if string in BASE_NORMS:
-        return BASE_NORMS[string]
-    # set stem word as norm,  if available,  adapted from:
-    # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
-    # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
-    for suffix_group in reversed(_stem_suffixes):
-        length = len(suffix_group[0])
-        if len(string) <= length:
-            break
-        for suffix in suffix_group:
-            if string.endswith(suffix):
-                return string[:-length]
-    return string
-
-
-def like_num(text):
-    if text.startswith(("+", "-", "±", "~")):
-        text = text[1:]
-    text = text.replace(", ", "").replace(".", "")
-    if text.isdigit():
-        return True
-    if text.count("/") == 1:
-        num, denom = text.split("/")
-        if num.isdigit() and denom.isdigit():
-            return True
-    if text.lower() in _num_words:
-        return True
-    return False
-
-
-LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
diff --git a/spacy/lang/ne/stop_words.py b/spacy/lang/ne/stop_words.py
deleted file mode 100644
index f008697d0..000000000
--- a/spacy/lang/ne/stop_words.py
+++ /dev/null
@@ -1,498 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-
-# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
-
-STOP_WORDS = set(
-    """
-अक्सर
-अगाडि
-अगाडी
-अघि
-अझै
-अठार
-अथवा
-अनि
-अनुसार
-अन्तर्गत
-अन्य
-अन्यत्र
-अन्यथा
-अब
-अरु
-अरुलाई
-अरू
-अर्को
-अर्थात
-अर्थात्
-अलग
-अलि
-अवस्था
-अहिले
-आए
-आएका
-आएको
-आज
-आजको
-आठ
-आत्म
-आदि
-आदिलाई
-आफनो
-आफू
-आफूलाई
-आफै
-आफैँ
-आफ्नै
-आफ्नो
-आयो
-उ
-उक्त
-उदाहरण
-उनको
-उनलाई
-उनले
-उनि
-उनी
-उनीहरुको
-उन्नाइस
-उप
-उसको
-उसलाई
-उसले
-उहालाई
-ऊ
-एउटा
-एउटै
-एक
-एकदम
-एघार
-ओठ
-औ
-औं
-कता
-कति
-कतै
-कम
-कमसेकम
-कसरि
-कसरी
-कसै
-कसैको
-कसैलाई
-कसैले
-कसैसँग
-कस्तो
-कहाँबाट
-कहिलेकाहीं
-का
-काम
-कारण
-कि
-किन
-किनभने
-कुन
-कुनै
-कुन्नी
-कुरा
-कृपया
-के
-केहि
-केही
-को
-कोहि
-कोहिपनि
-कोही
-कोहीपनि
-क्रमशः
-गए
-गएको
-गएर
-गयौ
-गरि
-गरी
-गरे
-गरेका
-गरेको
-गरेर
-गरौं
-गर्छ
-गर्छन्
-गर्छु
-गर्दा
-गर्दै
-गर्न
-गर्नु
-गर्नुपर्छ
-गर्ने
-गैर
-घर
-चार
-चाले
-चाहनुहुन्छ
-चाहन्छु
-चाहिं
-चाहिए
-चाहिंले
-चाहीं
-चाहेको
-चाहेर
-चोटी
-चौथो
-चौध
-छ
-छन
-छन्
-छु
-छू
-छैन
-छैनन्
-छौ
-छौं
-जता
-जताततै
-जना
-जनाको
-जनालाई
-जनाले
-जब
-जबकि
-जबकी
-जसको
-जसबाट
-जसमा
-जसरी
-जसलाई
-जसले
-जस्ता
-जस्तै
-जस्तो
-जस्तोसुकै
-जहाँ
-जान
-जाने
-जाहिर
-जुन
-जुनै
-जे
-जो
-जोपनि
-जोपनी
-झैं
-ठाउँमा
-ठीक
-ठूलो
-त
-तता
-तत्काल
-तथा
-तथापि
-तथापी
-तदनुसार
-तपाइ
-तपाई
-तपाईको
-तब
-तर
-तर्फ
-तल
-तसरी
-तापनि
-तापनी
-तिन
-तिनि
-तिनिहरुलाई
-तिनी
-तिनीहरु
-तिनीहरुको
-तिनीहरू
-तिनीहरूको
-तिनै
-तिमी
-तिर
-तिरको
-ती
-तीन
-तुरन्त
-तुरुन्त
-तुरुन्तै
-तेश्रो
-तेस्कारण
-तेस्रो
-तेह्र
-तैपनि
-तैपनी
-त्यत्तिकै
-त्यत्तिकैमा
-त्यस
-त्यसकारण
-त्यसको
-त्यसले
-त्यसैले
-त्यसो
-त्यस्तै
-त्यस्तो
-त्यहाँ
-त्यहिँ
-त्यही
-त्यहीँ
-त्यहीं
-त्यो
-त्सपछि
-त्सैले
-थप
-थरि
-थरी
-थाहा
-थिए
-थिएँ
-थिएन
-थियो
-दर्ता
-दश
-दिए
-दिएको
-दिन
-दिनुभएको
-दिनुहुन्छ
-दुइ
-दुइवटा
-दुई
-देखि
-देखिन्छ
-देखियो
-देखे
-देखेको
-देखेर
-दोश्री
-दोश्रो
-दोस्रो
-द्वारा
-धन्न
-धेरै
-धौ
-न
-नगर्नु
-नगर्नू
-नजिकै
-नत्र
-नत्रभने
-नभई
-नभएको
-नभनेर
-नयाँ
-नि
-निकै
-निम्ति
-निम्न
-निम्नानुसार
-निर्दिष्ट
-नै
-नौ
-पक्का
-पक्कै
-पछाडि
-पछाडी
-पछि
-पछिल्लो
-पछी
-पटक
-पनि
-पन्ध्र
-पर्छ
-पर्थ्यो
-पर्दैन
-पर्ने
-पर्नेमा
-पर्याप्त
-पहिले
-पहिलो
-पहिल्यै
-पाँच
-पांच
-पाचौँ
-पाँचौं
-पिच्छे
-पूर्व
-पो
-प्रति
-प्रतेक
-प्रत्यक
-प्राय
-प्लस
-फरक
-फेरि
-फेरी
-बढी
-बताए
-बने
-बरु
-बाट
-बारे
-बाहिर
-बाहेक
-बाह्र
-बिच
-बिचमा
-बिरुद्ध
-बिशेष
-बिस
-बीच
-बीचमा
-बीस
-भए
-भएँ
-भएका
-भएकालाई
-भएको
-भएन
-भएर
-भन
-भने
-भनेको
-भनेर
-भन्
-भन्छन्
-भन्छु
-भन्दा
-भन्दै
-भन्नुभयो
-भन्ने
-भन्या
-भयेन
-भयो
-भर
-भरि
-भरी
-भा
-भित्र
-भित्री
-भीत्र
-म
-मध्य
-मध्ये
-मलाई
-मा
-मात्र
-मात्रै
-माथि
-माथी
-मुख्य
-मुनि
-मुन्तिर
-मेरो
-मैले
-यति
-यथोचित
-यदि
-यद्ध्यपि
-यद्यपि
-यस
-यसका
-यसको
-यसपछि
-यसबाहेक
-यसमा
-यसरी
-यसले
-यसो
-यस्तै
-यस्तो
-यहाँ
-यहाँसम्म
-यही
-या
-यी
-यो
-र
-रही
-रहेका
-रहेको
-रहेछ
-राखे
-राख्छ
-राम्रो
-रुपमा
-रूप
-रे
-लगभग
-लगायत
-लाई
-लाख
-लागि
-लागेको
-ले
-वटा
-वरीपरी
-वा
-वाट
-वापत
-वास्तवमा
-शायद
-सक्छ
-सक्ने
-सँग
-संग
-सँगको
-सँगसँगै
-सँगै
-संगै
-सङ्ग
-सङ्गको
-सट्टा
-सत्र
-सधै
-सबै
-सबैको
-सबैलाई
-समय
-समेत
-सम्भव
-सम्म
-सय
-सरह
-सहित
-सहितै
-सही
-साँच्चै
-सात
-साथ
-साथै
-सायद
-सारा
-सुनेको
-सुनेर
-सुरु
-सुरुको
-सुरुमै
-सो
-सोचेको
-सोचेर
-सोही
-सोह्र
-स्थित
-स्पष्ट
-हजार
-हरे
-हरेक
-हामी
-हामीले
-हाम्रा
-हाम्रो
-हुँदैन
-हुन
-हुनत
-हुनु
-हुने
-हुनेछ
-हुन्
-हुन्छ
-हुन्थ्यो
-हैन
-हो
-होइन
-होकि
-होला
-""".split()
-)
diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx
index 8042098d7..1df516dcb 100644
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@@ -349,7 +349,7 @@ cdef class Lexeme:
     @property
     def is_oov(self):
         """RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
-        return self.orth not in self.vocab.vectors
+        return self.orth in self.vocab.vectors
 
     property is_stop:
         """RETURNS (bool): Whether the lexeme is a stop word."""
diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx
index 8f07bf8f7..3f40cb545 100644
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@@ -528,10 +528,10 @@ class Tagger(Pipe):
                         new_tag_map[tag] = orig_tag_map[tag]
                     else:
                         new_tag_map[tag] = {POS: X}
+        if "_SP" in orig_tag_map:
+            new_tag_map["_SP"] = orig_tag_map["_SP"]
         cdef Vocab vocab = self.vocab
         if new_tag_map:
-            if "_SP" in orig_tag_map:
-                new_tag_map["_SP"] = orig_tag_map["_SP"]
             vocab.morphology = Morphology(vocab.strings, new_tag_map,
                                           vocab.morphology.lemmatizer,
                                           exc=vocab.morphology.exc)
diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py
index 91b7e4d9d..1f13da5d6 100644
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@@ -170,11 +170,6 @@ def nb_tokenizer():
     return get_lang_class("nb").Defaults.create_tokenizer()
 
 
-@pytest.fixture(scope="session")
-def ne_tokenizer():
-    return get_lang_class("ne").Defaults.create_tokenizer()
-
-
 @pytest.fixture(scope="session")
 def nl_tokenizer():
     return get_lang_class("nl").Defaults.create_tokenizer()
diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py
index 651e906eb..26be5cf59 100644
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@@ -4,7 +4,7 @@ from __future__ import unicode_literals
 import pytest
 
 from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
-from spacy.lang.ja import Japanese, DetailedToken
+from spacy.lang.ja import Japanese
 
 # fmt: off
 TOKENIZER_TESTS = [
@@ -96,57 +96,6 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
     assert len(nlp_c(text)) == len_c
 
 
-@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
-    [
-        (
-            "選挙管理委員会",
-            [None, None, None, None],
-            [None, None, [
-                [
-                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
-                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
-                ]
-            ]],
-            [[
-                [
-                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
-                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
-                    DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
-                    DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
-                ], [
-                    DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
-                    DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
-                    DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
-                ]
-            ]]
-        ),
-    ]
-)
-def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
-    nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
-    nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
-    nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
-
-    assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
-    assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
-    assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
-    assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
-
-
-@pytest.mark.parametrize("text,inflections,reading_forms",
-    [
-        (
-            "取ってつけた",
-            ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
-            ("トッ", "テ", "ツケ", "タ"),
-        ),
-    ]
-)
-def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
-    assert ja_tokenizer(text).user_data["inflections"] == inflections
-    assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
-
-
 def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
     doc = ja_tokenizer("")
     assert len(doc) == 0
diff --git a/spacy/tests/lang/ne/__init__.py b/spacy/tests/lang/ne/__init__.py
deleted file mode 100644
index e69de29bb..000000000
diff --git a/spacy/tests/lang/ne/test_text.py b/spacy/tests/lang/ne/test_text.py
deleted file mode 100644
index 926a7de04..000000000
--- a/spacy/tests/lang/ne/test_text.py
+++ /dev/null
@@ -1,19 +0,0 @@
-# coding: utf-8
-from __future__ import unicode_literals
-
-import pytest
-
-
-def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
-    text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
-    tokens = ne_tokenizer(text)
-    assert len(tokens) == 24
-
-
-@pytest.mark.parametrize(
-    "text,length",
-    [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
-)
-def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
-    tokens = ne_tokenizer(text)
-    assert len(tokens) == length
\ No newline at end of file
diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py
index 1681ffeaa..a5bda9090 100644
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@@ -3,7 +3,6 @@ from __future__ import unicode_literals
 
 import pytest
 from spacy.language import Language
-from spacy.symbols import POS, NOUN
 
 
 def test_label_types():
@@ -12,16 +11,3 @@ def test_label_types():
     nlp.get_pipe("tagger").add_label("A")
     with pytest.raises(ValueError):
         nlp.get_pipe("tagger").add_label(9)
-
-
-def test_tagger_begin_training_tag_map():
-    """Test that Tagger.begin_training() without gold tuples does not clobber
-    the tag map."""
-    nlp = Language()
-    tagger = nlp.create_pipe("tagger")
-    orig_tag_count = len(tagger.labels)
-    tagger.add_label("A", {"POS": "NOUN"})
-    nlp.add_pipe(tagger)
-    nlp.begin_training()
-    assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
-    assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py
index b31cef1f2..576ca93d2 100644
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@@ -376,6 +376,6 @@ def test_vector_is_oov():
     data[1] = 2.0
     vocab.set_vector("cat", data[0])
     vocab.set_vector("dog", data[1])
-    assert vocab["cat"].is_oov is False
-    assert vocab["dog"].is_oov is False
-    assert vocab["hamster"].is_oov is True
+    assert vocab["cat"].is_oov is True
+    assert vocab["dog"].is_oov is True
+    assert vocab["hamster"].is_oov is False
diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx
index 8d3406bae..45deebc93 100644
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@@ -923,7 +923,7 @@ cdef class Token:
     @property
     def is_oov(self):
         """RETURNS (bool): Whether the token is out-of-vocabulary."""
-        return self.c.lex.orth not in self.vocab.vectors
+        return self.c.lex.orth in self.vocab.vectors
 
     @property
     def is_stop(self):
diff --git a/spacy/util.py b/spacy/util.py
index 923f56b31..5362952e2 100644
--- a/spacy/util.py
+++ b/spacy/util.py
@@ -208,10 +208,6 @@ def load_model_from_path(model_path, meta=False, **overrides):
         pipeline = nlp.Defaults.pipe_names
     elif pipeline in (False, None):
         pipeline = []
-    # skip "vocab" from overrides in component initialization since vocab is
-    # already configured from overrides when nlp is initialized above
-    if "vocab" in overrides:
-        del overrides["vocab"]
     for name in pipeline:
         if name not in disable:
             config = meta.get("pipeline_args", {}).get(name, {})
diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md
index bc33dd4e6..5df625991 100644
--- a/website/docs/api/goldparse.md
+++ b/website/docs/api/goldparse.md
@@ -12,18 +12,18 @@ expects true examples of a label to have the value `1.0`, and negative examples
 of a label to have the value `0.0`. Labels not in the dictionary are treated as
 missing – the gradient for those labels will be zero.
 
-| Name              | Type        | Description                                                                                                                                                                                                                            |
-| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doc`             | `Doc`       | The document the annotations refer to.                                                                                                                                                                                                 |
-| `words`           | iterable    | A sequence of unicode word strings.                                                                                                                                                                                                    |
-| `tags`            | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                                   |
-| `heads`           | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                                                                                                           |
-| `deps`            | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                                                                                                      |
-| `entities`        | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
-| `cats`            | dict        | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative).                                                                                  |
-| `links`           | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative).                       |
-| `make_projective` | bool        | Whether to projectivize the dependency tree. Defaults to `False`.                                                                                                                                                                      |
-| **RETURNS**       | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
+| Name        | Type        | Description                                                                                                                                                                                                                            |
+| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `doc`       | `Doc`       | The document the annotations refer to.                                                                                                                                                                                                 |
+| `words`     | iterable    | A sequence of unicode word strings.                                                                                                                                                                                                    |
+| `tags`      | iterable    | A sequence of strings, representing tag annotations.                                                                                                                                                                                   |
+| `heads`     | iterable    | A sequence of integers, representing syntactic head offsets.                                                                                                                                                                           |
+| `deps`      | iterable    | A sequence of strings, representing the syntactic relation types.                                                                                                                                                                      |
+| `entities`  | iterable    | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
+| `cats`      | dict        | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative).                                                                                  |
+| `links`     | dict        | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative).                       |
+| `make_projective` | bool  | Whether to projectivize the dependency tree. Defaults to `False.`.                                                                                     |
+| **RETURNS** | `GoldParse` | The newly constructed object.                                                                                                                                                                                                          |
 
 ## GoldParse.\_\_len\_\_ {#len tag="method"}
 
@@ -43,17 +43,17 @@ Whether the provided syntactic annotations form a projective dependency tree.
 
 ## Attributes {#attributes}
 
-| Name                                 | Type | Description                                                                                                              |
-| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
-| `words`                              | list | The words.                                                                                                               |
-| `tags`                               | list | The part-of-speech tag annotations.                                                                                      |
-| `heads`                              | list | The syntactic head annotations.                                                                                          |
-| `labels`                             | list | The syntactic relation-type annotations.                                                                                 |
-| `ner`                                | list | The named entity annotations as BILUO tags.                                                                              |
-| `cand_to_gold`                       | list | The alignment from candidate tokenization to gold tokenization.                                                          |
-| `gold_to_cand`                       | list | The alignment from gold tokenization to candidate tokenization.                                                          |
-| `cats` <Tag variant="new">2</Tag>    | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`.                                            |
-| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
+| Name                                 | Type | Description                                                                                                                                              |
+| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `words`                              | list | The words.                                                                                                                                               |
+| `tags`                               | list | The part-of-speech tag annotations.                                                                                                                      |
+| `heads`                              | list | The syntactic head annotations.                                                                                                                          |
+| `labels`                             | list | The syntactic relation-type annotations.                                                                                                                 |
+| `ner`                                | list | The named entity annotations as BILUO tags.                                                                                                              |
+| `cand_to_gold`                       | list | The alignment from candidate tokenization to gold tokenization.                                                                                          |
+| `gold_to_cand`                       | list | The alignment from gold tokenization to candidate tokenization.                                                                                          |
+| `cats` <Tag variant="new">2</Tag>    | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`.                                                                            |
+| `links` <Tag variant="new">2.2</Tag> | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries.                                 |
 
 ## Utilities {#util}
 
@@ -61,8 +61,7 @@ Whether the provided syntactic annotations form a projective dependency tree.
 
 Convert a list of Doc objects into the
 [JSON-serializable format](/api/annotation#json-input) used by the
-[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
-'paragraph' in the output doc.
+[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
 
 > #### Example
 >
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index 7b195e352..ac2f898e0 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects.
 
 | Name        | Type         | Description                                                                                                                                                              |
 | ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `doclike`   | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).                                                                                                                     |
+| `doclike`   | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3)..                                                                                                                    |
 | **RETURNS** | list         | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
 
 <Infobox title="Important note" variant="warning">
diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md
index 1e8960edf..1a438e424 100644
--- a/website/docs/usage/101/_pos-deps.md
+++ b/website/docs/usage/101/_pos-deps.md
@@ -36,7 +36,7 @@ for token in doc:
 | Text    | Lemma   | POS     | Tag   | Dep        | Shape   | alpha   | stop    |
 | ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
 | Apple   | apple   | `PROPN` | `NNP` | `nsubj`    | `Xxxxx` | `True`  | `False` |
-| is      | be      | `AUX`   | `VBZ` | `aux`      | `xx`    | `True`  | `True`  |
+| is      | be      | `VERB`  | `VBZ` | `aux`      | `xx`    | `True`  | `True`  |
 | looking | look    | `VERB`  | `VBG` | `ROOT`     | `xxxx`  | `True`  | `False` |
 | at      | at      | `ADP`   | `IN`  | `prep`     | `xx`    | `True`  | `True`  |
 | buying  | buy     | `VERB`  | `VBG` | `pcomp`    | `xxxx`  | `True`  | `False` |
diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md
index 29a9a1c27..d42aad705 100644
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole
 documents**, not just single sentences. If your corpus only contains single
 sentences, spaCy's models will never learn to expect multi-sentence documents,
 leading to low performance on real text. To mitigate this problem, you can use
-the `-n` argument to the `spacy convert` command, to merge some of the sentences
+the `-N` argument to the `spacy convert` command, to merge some of the sentences
 into longer pseudo-documents.
 
 ### Training the tagger and parser {#train-tagger-parser}
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index 9031a356f..84bb3d71b 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.")
 print("Before", doc.ents)  # []
 
 header = [ENT_IOB, ENT_TYPE]
-attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
+attr_array = numpy.zeros((len(doc), len(header)))
 attr_array[0, 0] = 3  # B
 attr_array[0, 1] = doc.vocab.strings["GPE"]
 doc.from_array(header, attr_array)
@@ -1143,9 +1143,9 @@ from spacy.gold import align
 other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
 spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
 cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
-print("Edit distance:", cost)  # 3
+print("Misaligned tokens:", cost)  # 2
 print("One-to-one mappings a -> b", a2b)  # array([0, 1, 2, 3, -1, -1, 5, 6])
-print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, -1, 6, 7])
+print("One-to-one mappings b -> a", b2a)  # array([0, 1, 2, 3, 5, 6, 7])
 print("Many-to-one mappings a -> b", a2b_multi)  # {4: 4, 5: 4}
 print("Many-to-one mappings b-> a", b2a_multi)  # {}
 ```
@@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi)  # {}
 Here are some insights from the alignment information generated in the example
 above:
 
-- The edit distance (cost) is `3`: two deletions and one insertion.
+- Two tokens are misaligned.
 - The one-to-one mappings for the first four tokens are identical, which means
   they map to each other. This makes sense because they're also identical in the
   input: `"i"`, `"listened"`, `"to"` and `"obama"`.
diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md
index b11e6347a..382193157 100644
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@@ -117,18 +117,6 @@ The Chinese language class supports three word segmentation options:
    better segmentation for Chinese OntoNotes and the new
    [Chinese models](/models/zh).
 
-<Infobox variant="warning">
-
-Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
-with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
-install it from our fork and compile it locally:
-
-```bash
-$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
-```
-
-</Infobox>
-
 <Accordion title="Details on spaCy's PKUSeg API">
 
 The `meta` argument of the `Chinese` language class supports the following
@@ -208,20 +196,12 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
 
 The Japanese language class uses
 [SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. The default Japanese language class and
-the provided Japanese models use SudachiPy split mode `A`.
+segmentation and part-of-speech tagging. The default Japanese language class
+and the provided Japanese models use SudachiPy split mode `A`.
 
 The `meta` argument of the `Japanese` language class can be used to configure
 the split mode to `A`, `B` or `C`.
 
-<Infobox variant="warning">
-
-If you run into errors related to `sudachipy`, which is currently under active
-development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
-used for training the current [Japanese models](/models/ja).
-
-</Infobox>
-
 ## Installing and using models {#download}
 
 > #### Downloading models in spaCy < v1.7
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index f7866fe31..1db2405d1 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -1158,17 +1158,17 @@ what you need for your application.
 > available corpus.
 
 For example, the corpus spaCy's [English models](/models/en) were trained on
-defines a `PERSON` entity as just the **person name**, without titles like "Mr."
-or "Dr.". This makes sense, because it makes it easier to resolve the entity
-type back to a knowledge base. But what if your application needs the full
-names, _including_ the titles?
+defines a `PERSON` entity as just the **person name**, without titles like "Mr"
+or "Dr". This makes sense, because it makes it easier to resolve the entity type
+back to a knowledge base. But what if your application needs the full names,
+_including_ the titles?
 
 ```python
 ### {executable="true"}
 import spacy
 
 nlp = spacy.load("en_core_web_sm")
-doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
 print([(ent.text, ent.label_) for ent in doc.ents])
 ```
 
@@ -1233,7 +1233,7 @@ def expand_person_entities(doc):
 # Add the component after the named entity recognizer
 nlp.add_pipe(expand_person_entities, after='ner')
 
-doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
 print([(ent.text, ent.label_) for ent in doc.ents])
 ```
 
diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md
index e6b88c779..ba75b01ab 100644
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
 vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
 and Romanian** and updated the training data and vectors for most languages.
 Model packages with vectors are about **2&times** smaller on disk and load
-**2-4&times;** faster. For the full changelog, see the
-[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
-For more details and a behind-the-scenes look at the new release,
-[see our blog post](https://explosion.ai/blog/spacy-v2-3).
+**2-4&times;** faster. For the full changelog, see the [release notes on
+GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
+details and a behind-the-scenes look at the new release, [see our blog
+post](https://explosion.ai/blog/spacy-v2-3).
 
 ### Expanded model families with vectors {#models}
 
@@ -33,10 +33,10 @@ For more details and a behind-the-scenes look at the new release,
 
 With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
 `md` and `lg` models with word vectors for all languages, this release provides
-a total of 46 model packages. For models trained using
-[Universal Dependencies](https://universaldependencies.org) corpora, the
-training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
-and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
+a total of 46 model packages. For models trained using [Universal
+Dependencies](https://universaldependencies.org) corpora, the training data has
+been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
+extended to include both UD Dutch Alpino and LassySmall.
 
 <Infobox>
 
@@ -48,7 +48,6 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
 ### Chinese {#chinese}
 
 > #### Example
->
 > ```python
 > from spacy.lang.zh import Chinese
 >
@@ -58,49 +57,41 @@ and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
 >
 > # Append words to user dict
 > nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
-> ```
 
 This release adds support for
-[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
-the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
-Chinese tokenizer can be initialized with both `pkuseg` and custom models and
-the `pkuseg` user dictionary is easy to customize. Note that
-[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
-pre-compiled wheels for Python 3.8. See the
-[usage documentation](/usage/models#chinese) for details on how to install it on
-Python 3.8.
+[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
+the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
+The Chinese tokenizer can be initialized with both `pkuseg` and custom models
+and the `pkuseg` user dictionary is easy to customize.
 
 <Infobox>
 
-**Models:** [Chinese models](/models/zh) **Usage: **
-[Chinese tokenizer usage](/usage/models#chinese)
+**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
 
 </Infobox>
 
 ### Japanese {#japanese}
 
 The updated Japanese language class switches to
-[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
+[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
+segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
 installing spaCy for Japanese, which is now possible with a single command:
 `pip install spacy[ja]`.
 
 <Infobox>
 
-**Models:** [Japanese models](/models/ja) **Usage:**
-[Japanese tokenizer usage](/usage/models#japanese)
+**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
 
 </Infobox>
 
 ### Small CLI updates
 
-- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
-  in a base model with `spacy debug-data lang train dev -b base_model`
-- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
-  `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
-  without loading a model
-- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
-  the first iteration
+- `spacy debug-data` provides the coverage of the vectors in a base model with
+  `spacy debug-data lang train dev -b base_model`
+- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
+  dev.json`) to evaluate the tokenization accuracy without loading a model
+- `spacy train` on GPU restricts the CPU timing evaluation to the first
+  iteration
 
 ## Backwards incompatibilities {#incompat}
 
@@ -109,8 +100,8 @@ installing spaCy for Japanese, which is now possible with a single command:
 If you've been training **your own models**, you'll need to **retrain** them
 with the new version. Also don't forget to upgrade all models to the latest
 versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
-with models for v2.3. To check if all of your models are up to date, you can run
-the [`spacy validate`](/api/cli#validate) command.
+with models for v2.3. To check if all of your models are up to date, you can
+run the [`spacy validate`](/api/cli#validate) command.
 
 </Infobox>
 
@@ -125,20 +116,21 @@ the [`spacy validate`](/api/cli#validate) command.
 > directly.
 
 - If you're training new models, you'll want to install the package
-  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
-  now includes both the lemmatization tables (as in v2.2) and the normalization
-  tables (new in v2.3). If you're using pretrained models, **nothing changes**,
-  because the relevant tables are included in the model packages.
+  [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
+  which now includes both the lemmatization tables (as in v2.2) and the
+  normalization tables (new in v2.3). If you're using pretrained models,
+  **nothing changes**, because the relevant tables are included in the model
+  packages.
 - Due to the updated Universal Dependencies training data, the fine-grained
   part-of-speech tags will change for many provided language models. The
   coarse-grained part-of-speech tagset remains the same, but the mapping from
   particular fine-grained to coarse-grained tags may show minor differences.
 - For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
-  tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
-  for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
-  increases the accuracy of the models by improving the alignment between
-  spaCy's tokenization and Universal Dependencies multi-word tokens used for
-  contractions.
+  tagsets contain new merged tags related to contracted forms, such as
+  `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
+  `"à"`. This increases the accuracy of the models by improving the alignment
+  between spaCy's tokenization and Universal Dependencies multi-word tokens
+  used for contractions.
 
 ### Migrating from spaCy 2.2 {#migrating}
 
@@ -151,81 +143,29 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
 and earlier versions.
 
 A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
-cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
-comma at the end of a URL) before applying the match. See the full
-[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
+cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
+a comma at the end of a URL) before applying the match. See the full [tokenizer
+documentation](/usage/linguistic-features#tokenization) and try out
 [`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
 debugging your tokenizer configuration.
 
 #### Warnings configuration
 
-spaCy's custom warnings have been replaced with native Python
+spaCy's custom warnings have been replaced with native python
 [`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
-setting `SPACY_WARNING_IGNORE`, use the [`warnings`
+setting `SPACY_WARNING_IGNORE`, use the [warnings
 filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
 to manage warnings.
 
-```diff
-import spacy
-+ import warnings
-
-- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
-+ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
-```
-
 #### Normalization tables
 
 The normalization tables have moved from the language data in
-[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
-package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
-If you're adding data for a new language, the normalization table should be
-added to `spacy-lookups-data`. See
-[adding norm exceptions](/usage/adding-languages#norm-exceptions).
-
-#### No preloaded lexemes/vocab for models with vectors
-
-To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
-loaded on initialization for models with vectors. As you process texts, the
-lexemes will be added to the vocab automatically, just as in models without
-vectors.
-
-To see the number of unique vectors and number of words with vectors, see
-`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
-unique vectors and `684830` words with vectors:
-
-```python
-{
-    'width': 300,
-    'vectors': 20000,
-    'keys': 684830,
-    'name': 'en_core_web_md.vectors'
-}
-```
-
-If required, for instance if you are working directly with word vectors rather
-than processing texts, you can load all lexemes for words with vectors at once:
-
-```python
-for orth in nlp.vocab.vectors:
-    _ = nlp.vocab[orth]
-```
-
-#### Lexeme.is_oov and Token.is_oov
-
-<Infobox title="Important note" variant="warning">
-
-Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
-fixed in the next patch release v2.3.1.
-
-</Infobox>
-
-In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
-have a word vector. This is equivalent to `token.orth not in
-nlp.vocab.vectors`.
-
-Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
-probability and cluster features. The probability and cluster features are no
-longer included in the provided medium and large models (see the next section).
+[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
+the package
+[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
+you're adding data for a new language, the normalization table should be added
+to `spacy-lookups-data`. See [adding norm
+exceptions](/usage/adding-languages#norm-exceptions).
 
 #### Probability and cluster features
 
@@ -241,28 +181,28 @@ longer included in the provided medium and large models (see the next section).
 
 The `Token.prob` and `Token.cluster` features, which are no longer used by the
 core pipeline components as of spaCy v2, are no longer provided in the
-pretrained models to reduce the model size. To keep these features available for
-users relying on them, the `prob` and `cluster` features for the most frequent
-1M tokens have been moved to
+pretrained models to reduce the model size. To keep these features available
+for users relying on them, the `prob` and `cluster` features for the most
+frequent 1M tokens have been moved to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
 `extra` features for the relevant languages (English, German, Greek and
 Spanish).
 
 The extra tables are loaded lazily, so if you have `spacy-lookups-data`
-installed and your code accesses `Token.prob`, the full table is loaded into the
-model vocab, which will take a few seconds on initial loading. When you save
-this model after loading the `prob` table, the full `prob` table will be saved
-as part of the model vocab.
+installed and your code accesses `Token.prob`, the full table is loaded into
+the model vocab, which will take a few seconds on initial loading. When you
+save this model after loading the `prob` table, the full `prob` table will be
+saved as part of the model vocab.
 
-If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
-of a new model, add the data to
+If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
+part of a new model, add the data to
 [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
 the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
 initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
 [`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
 `lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
-currently only used to provide a custom `oov_prob`. See examples in the
-[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
+currently only used to provide a custom `oov_prob`. See examples in the [`data`
+directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
 in `spacy-lookups-data`.
 
 #### Initializing new models without extra lookups tables
diff --git a/website/meta/site.json b/website/meta/site.json
index 8b8424f82..29d71048e 100644
--- a/website/meta/site.json
+++ b/website/meta/site.json
@@ -23,9 +23,9 @@
         "apiKey": "371e26ed49d29a27bd36273dfdaf89af",
         "indexName": "spacy"
     },
-    "binderUrl": "explosion/spacy-io-binder",
+    "binderUrl": "ines/spacy-io-binder",
     "binderBranch": "live",
-    "binderVersion": "2.3.0",
+    "binderVersion": "2.2.0",
     "sections": [
         { "id": "usage", "title": "Usage Documentation", "theme": "blue" },
         { "id": "models", "title": "Models Documentation", "theme": "blue" },