diff --git a/.github/contributors/PluieElectrique.md b/.github/contributors/PluieElectrique.md
new file mode 100644
index 000000000..97e01650a
--- /dev/null
+++ b/.github/contributors/PluieElectrique.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [X] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Pluie |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 2020-06-18 |
+| GitHub username | PluieElectrique |
+| Website (optional) | |
diff --git a/.github/contributors/mahnerak.md b/.github/contributors/mahnerak.md
new file mode 100644
index 000000000..cc7739681
--- /dev/null
+++ b/.github/contributors/mahnerak.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Karen Hambardzumyan |
+| Company name (if applicable) | YerevaNN |
+| Title or role (if applicable) | Researcher |
+| Date | 2020-06-19 |
+| GitHub username | mahnerak |
+| Website (optional) | https://mahnerak.com/|
diff --git a/.github/contributors/myavrum.md b/.github/contributors/myavrum.md
new file mode 100644
index 000000000..dc8f1bb84
--- /dev/null
+++ b/.github/contributors/myavrum.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Marat M. Yavrumyan |
+| Company name (if applicable) | YSU, UD_Armenian Project |
+| Title or role (if applicable) | Dr., Principal Investigator |
+| Date | 2020-06-19 |
+| GitHub username | myavrum |
+| Website (optional) | http://armtreebank.yerevann.com/ |
diff --git a/.github/contributors/rameshhpathak.md b/.github/contributors/rameshhpathak.md
new file mode 100644
index 000000000..30a543307
--- /dev/null
+++ b/.github/contributors/rameshhpathak.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Ramesh Pathak |
+| Company name (if applicable) | Diyo AI |
+| Title or role (if applicable) | AI Engineer |
+| Date | June 21, 2020 |
+| GitHub username | rameshhpathak |
+| Website (optional) |rameshhpathak.github.io| |
diff --git a/.github/contributors/richardliaw.md b/.github/contributors/richardliaw.md
new file mode 100644
index 000000000..2af4ce840
--- /dev/null
+++ b/.github/contributors/richardliaw.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an “x” on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Richard Liaw |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 06/22/2020 |
+| GitHub username | richardliaw |
+| Website (optional) | |
\ No newline at end of file
diff --git a/spacy/lang/en/__init__.py b/spacy/lang/en/__init__.py
index 4304b3c6a..d52f3dfd8 100644
--- a/spacy/lang/en/__init__.py
+++ b/spacy/lang/en/__init__.py
@@ -18,6 +18,41 @@ def _return_en(_):
return "en"
+def en_is_base_form(univ_pos, morphology=None):
+ """
+ Check whether we're dealing with an uninflected paradigm, so we can
+ avoid lemmatization entirely.
+
+ univ_pos (unicode / int): The token's universal part-of-speech tag.
+ morphology (dict): The token's morphological features following the
+ Universal Dependencies scheme.
+ """
+ if morphology is None:
+ morphology = {}
+ if univ_pos == "noun" and morphology.get("Number") == "sing":
+ return True
+ elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
+ return True
+ # This maps 'VBP' to base form -- probably just need 'IS_BASE'
+ # morphology
+ elif univ_pos == "verb" and (
+ morphology.get("VerbForm") == "fin"
+ and morphology.get("Tense") == "pres"
+ and morphology.get("Number") is None
+ ):
+ return True
+ elif univ_pos == "adj" and morphology.get("Degree") == "pos":
+ return True
+ elif morphology.get("VerbForm") == "inf":
+ return True
+ elif morphology.get("VerbForm") == "none":
+ return True
+ elif morphology.get("Degree") == "pos":
+ return True
+ else:
+ return False
+
+
class EnglishDefaults(Language.Defaults):
lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
lex_attr_getters.update(LEX_ATTRS)
@@ -26,6 +61,7 @@ class EnglishDefaults(Language.Defaults):
tag_map = TAG_MAP
stop_words = STOP_WORDS
morph_rules = MORPH_RULES
+ is_base_form = en_is_base_form
syntax_iterators = SYNTAX_ITERATORS
single_orth_variants = [
{"tags": ["NFP"], "variants": ["…", "..."]},
diff --git a/spacy/lang/hy/examples.py b/spacy/lang/hy/examples.py
index 323f77b1c..8a00fd243 100644
--- a/spacy/lang/hy/examples.py
+++ b/spacy/lang/hy/examples.py
@@ -11,6 +11,6 @@ Example sentences to test spaCy and its language models.
sentences = [
"Լոնդոնը Միացյալ Թագավորության մեծ քաղաք է։",
"Ո՞վ է Ֆրանսիայի նախագահը։",
- "Որն է Միացյալ Նահանգների մայրաքաղաքը։",
+ "Ո՞րն է Միացյալ Նահանգների մայրաքաղաքը։",
"Ե՞րբ է ծնվել Բարաք Օբաման։",
]
diff --git a/spacy/lang/hy/lex_attrs.py b/spacy/lang/hy/lex_attrs.py
index 910625fb8..dea3c0e97 100644
--- a/spacy/lang/hy/lex_attrs.py
+++ b/spacy/lang/hy/lex_attrs.py
@@ -5,8 +5,8 @@ from ...attrs import LIKE_NUM
_num_words = [
- "զրօ",
- "մէկ",
+ "զրո",
+ "մեկ",
"երկու",
"երեք",
"չորս",
@@ -18,20 +18,21 @@ _num_words = [
"տասը",
"տասնմեկ",
"տասներկու",
- "տասներեք",
- "տասնչորս",
- "տասնհինգ",
- "տասնվեց",
- "տասնյոթ",
- "տասնութ",
- "տասնինը",
- "քսան" "երեսուն",
+ "տասներեք",
+ "տասնչորս",
+ "տասնհինգ",
+ "տասնվեց",
+ "տասնյոթ",
+ "տասնութ",
+ "տասնինը",
+ "քսան",
+ "երեսուն",
"քառասուն",
"հիսուն",
- "վաթցսուն",
+ "վաթսուն",
"յոթանասուն",
"ութսուն",
- "ինիսուն",
+ "իննսուն",
"հարյուր",
"հազար",
"միլիոն",
diff --git a/spacy/lang/ja/__init__.py b/spacy/lang/ja/__init__.py
index a7ad0846e..fb8b9d7fe 100644
--- a/spacy/lang/ja/__init__.py
+++ b/spacy/lang/ja/__init__.py
@@ -20,12 +20,7 @@ from ... import util
# Hold the attributes we need with convenient names
-DetailedToken = namedtuple("DetailedToken", ["surface", "pos", "lemma"])
-
-# Handling for multiple spaces in a row is somewhat awkward, this simplifies
-# the flow by creating a dummy with the same interface.
-DummyNode = namedtuple("DummyNode", ["surface", "pos", "lemma"])
-DummySpace = DummyNode(" ", " ", " ")
+DetailedToken = namedtuple("DetailedToken", ["surface", "tag", "inf", "lemma", "reading", "sub_tokens"])
def try_sudachi_import(split_mode="A"):
@@ -53,7 +48,7 @@ def try_sudachi_import(split_mode="A"):
)
-def resolve_pos(orth, pos, next_pos):
+def resolve_pos(orth, tag, next_tag):
"""If necessary, add a field to the POS tag for UD mapping.
Under Universal Dependencies, sometimes the same Unidic POS tag can
be mapped differently depending on the literal token or its context
@@ -64,124 +59,77 @@ def resolve_pos(orth, pos, next_pos):
# Some tokens have their UD tag decided based on the POS of the following
# token.
- # orth based rules
- if pos[0] in TAG_ORTH_MAP:
- orth_map = TAG_ORTH_MAP[pos[0]]
+ # apply orth based mapping
+ if tag in TAG_ORTH_MAP:
+ orth_map = TAG_ORTH_MAP[tag]
if orth in orth_map:
- return orth_map[orth], None
+ return orth_map[orth], None # current_pos, next_pos
- # tag bi-gram mapping
- if next_pos:
- tag_bigram = pos[0], next_pos[0]
+ # apply tag bi-gram mapping
+ if next_tag:
+ tag_bigram = tag, next_tag
if tag_bigram in TAG_BIGRAM_MAP:
- bipos = TAG_BIGRAM_MAP[tag_bigram]
- if bipos[0] is None:
- return TAG_MAP[pos[0]][POS], bipos[1]
+ current_pos, next_pos = TAG_BIGRAM_MAP[tag_bigram]
+ if current_pos is None: # apply tag uni-gram mapping for current_pos
+ return TAG_MAP[tag][POS], next_pos # only next_pos is identified by tag bi-gram mapping
else:
- return bipos
+ return current_pos, next_pos
- return TAG_MAP[pos[0]][POS], None
+ # apply tag uni-gram mapping
+ return TAG_MAP[tag][POS], None
-# Use a mapping of paired punctuation to avoid splitting quoted sentences.
-pairpunct = {'「':'」', '『': '』', '【': '】'}
-
-
-def separate_sentences(doc):
- """Given a doc, mark tokens that start sentences based on Unidic tags.
- """
-
- stack = [] # save paired punctuation
-
- for i, token in enumerate(doc[:-2]):
- # Set all tokens after the first to false by default. This is necessary
- # for the doc code to be aware we've done sentencization, see
- # `is_sentenced`.
- token.sent_start = (i == 0)
- if token.tag_:
- if token.tag_ == "補助記号-括弧開":
- ts = str(token)
- if ts in pairpunct:
- stack.append(pairpunct[ts])
- elif stack and ts == stack[-1]:
- stack.pop()
-
- if token.tag_ == "補助記号-句点":
- next_token = doc[i+1]
- if next_token.tag_ != token.tag_ and not stack:
- next_token.sent_start = True
-
-
-def get_dtokens(tokenizer, text):
- tokens = tokenizer.tokenize(text)
- words = []
- for ti, token in enumerate(tokens):
- tag = '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*'])
- inf = '-'.join([xx for xx in token.part_of_speech()[4:] if xx != '*'])
- dtoken = DetailedToken(
- token.surface(),
- (tag, inf),
- token.dictionary_form())
- if ti > 0 and words[-1].pos[0] == '空白' and tag == '空白':
- # don't add multiple space tokens in a row
- continue
- words.append(dtoken)
-
- # remove empty tokens. These can be produced with characters like … that
- # Sudachi normalizes internally.
- words = [ww for ww in words if len(ww.surface) > 0]
- return words
-
-
-def get_words_lemmas_tags_spaces(dtokens, text, gap_tag=("空白", "")):
+def get_dtokens_and_spaces(dtokens, text, gap_tag="空白"):
+ # Compare the content of tokens and text, first
words = [x.surface for x in dtokens]
if "".join("".join(words).split()) != "".join(text.split()):
raise ValueError(Errors.E194.format(text=text, words=words))
- text_words = []
- text_lemmas = []
- text_tags = []
+
+ text_dtokens = []
text_spaces = []
text_pos = 0
# handle empty and whitespace-only texts
if len(words) == 0:
- return text_words, text_lemmas, text_tags, text_spaces
+ return text_dtokens, text_spaces
elif len([word for word in words if not word.isspace()]) == 0:
assert text.isspace()
- text_words = [text]
- text_lemmas = [text]
- text_tags = [gap_tag]
+ text_dtokens = [DetailedToken(text, gap_tag, '', text, None, None)]
text_spaces = [False]
- return text_words, text_lemmas, text_tags, text_spaces
- # normalize words to remove all whitespace tokens
- norm_words, norm_dtokens = zip(*[(word, dtokens) for word, dtokens in zip(words, dtokens) if not word.isspace()])
- # align words with text
- for word, dtoken in zip(norm_words, norm_dtokens):
+ return text_dtokens, text_spaces
+
+ # align words and dtokens by referring text, and insert gap tokens for the space char spans
+ for word, dtoken in zip(words, dtokens):
+ # skip all space tokens
+ if word.isspace():
+ continue
try:
word_start = text[text_pos:].index(word)
except ValueError:
raise ValueError(Errors.E194.format(text=text, words=words))
+
+ # space token
if word_start > 0:
w = text[text_pos:text_pos + word_start]
- text_words.append(w)
- text_lemmas.append(w)
- text_tags.append(gap_tag)
+ text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
text_spaces.append(False)
text_pos += word_start
- text_words.append(word)
- text_lemmas.append(dtoken.lemma)
- text_tags.append(dtoken.pos)
+
+ # content word
+ text_dtokens.append(dtoken)
text_spaces.append(False)
text_pos += len(word)
+ # poll a space char after the word
if text_pos < len(text) and text[text_pos] == " ":
text_spaces[-1] = True
text_pos += 1
+
+ # trailing space token
if text_pos < len(text):
w = text[text_pos:]
- text_words.append(w)
- text_lemmas.append(w)
- text_tags.append(gap_tag)
+ text_dtokens.append(DetailedToken(w, gap_tag, '', w, None, None))
text_spaces.append(False)
- return text_words, text_lemmas, text_tags, text_spaces
+
+ return text_dtokens, text_spaces
class JapaneseTokenizer(DummyTokenizer):
@@ -191,29 +139,78 @@ class JapaneseTokenizer(DummyTokenizer):
self.tokenizer = try_sudachi_import(self.split_mode)
def __call__(self, text):
- dtokens = get_dtokens(self.tokenizer, text)
+ # convert sudachipy.morpheme.Morpheme to DetailedToken and merge continuous spaces
+ sudachipy_tokens = self.tokenizer.tokenize(text)
+ dtokens = self._get_dtokens(sudachipy_tokens)
+ dtokens, spaces = get_dtokens_and_spaces(dtokens, text)
- words, lemmas, unidic_tags, spaces = get_words_lemmas_tags_spaces(dtokens, text)
+ # create Doc with tag bi-gram based part-of-speech identification rules
+ words, tags, inflections, lemmas, readings, sub_tokens_list = zip(*dtokens) if dtokens else [[]] * 6
+ sub_tokens_list = list(sub_tokens_list)
doc = Doc(self.vocab, words=words, spaces=spaces)
- next_pos = None
- for idx, (token, lemma, unidic_tag) in enumerate(zip(doc, lemmas, unidic_tags)):
- token.tag_ = unidic_tag[0]
- if next_pos:
+ next_pos = None # for bi-gram rules
+ for idx, (token, dtoken) in enumerate(zip(doc, dtokens)):
+ token.tag_ = dtoken.tag
+ if next_pos: # already identified in previous iteration
token.pos = next_pos
next_pos = None
else:
token.pos, next_pos = resolve_pos(
token.orth_,
- unidic_tag,
- unidic_tags[idx + 1] if idx + 1 < len(unidic_tags) else None
+ dtoken.tag,
+ tags[idx + 1] if idx + 1 < len(tags) else None
)
-
# if there's no lemma info (it's an unk) just use the surface
- token.lemma_ = lemma
- doc.user_data["unidic_tags"] = unidic_tags
+ token.lemma_ = dtoken.lemma if dtoken.lemma else dtoken.surface
+
+ doc.user_data["inflections"] = inflections
+ doc.user_data["reading_forms"] = readings
+ doc.user_data["sub_tokens"] = sub_tokens_list
return doc
+ def _get_dtokens(self, sudachipy_tokens, need_sub_tokens=True):
+ sub_tokens_list = self._get_sub_tokens(sudachipy_tokens) if need_sub_tokens else None
+ dtokens = [
+ DetailedToken(
+ token.surface(), # orth
+ '-'.join([xx for xx in token.part_of_speech()[:4] if xx != '*']), # tag
+ ','.join([xx for xx in token.part_of_speech()[4:] if xx != '*']), # inf
+ token.dictionary_form(), # lemma
+ token.reading_form(), # user_data['reading_forms']
+ sub_tokens_list[idx] if sub_tokens_list else None, # user_data['sub_tokens']
+ ) for idx, token in enumerate(sudachipy_tokens) if len(token.surface()) > 0
+ # remove empty tokens which can be produced with characters like … that
+ ]
+ # Sudachi normalizes internally and outputs each space char as a token.
+ # This is the preparation for get_dtokens_and_spaces() to merge the continuous space tokens
+ return [
+ t for idx, t in enumerate(dtokens) if
+ idx == 0 or
+ not t.surface.isspace() or t.tag != '空白' or
+ not dtokens[idx - 1].surface.isspace() or dtokens[idx - 1].tag != '空白'
+ ]
+
+ def _get_sub_tokens(self, sudachipy_tokens):
+ if self.split_mode is None or self.split_mode == "A": # do nothing for default split mode
+ return None
+
+ sub_tokens_list = [] # list of (list of list of DetailedToken | None)
+ for token in sudachipy_tokens:
+ sub_a = token.split(self.tokenizer.SplitMode.A)
+ if len(sub_a) == 1: # no sub tokens
+ sub_tokens_list.append(None)
+ elif self.split_mode == "B":
+ sub_tokens_list.append([self._get_dtokens(sub_a, False)])
+ else: # "C"
+ sub_b = token.split(self.tokenizer.SplitMode.B)
+ if len(sub_a) == len(sub_b):
+ dtokens = self._get_dtokens(sub_a, False)
+ sub_tokens_list.append([dtokens, dtokens])
+ else:
+ sub_tokens_list.append([self._get_dtokens(sub_a, False), self._get_dtokens(sub_b, False)])
+ return sub_tokens_list
+
def _get_config(self):
config = OrderedDict(
(
diff --git a/spacy/lang/ja/bunsetu.py b/spacy/lang/ja/bunsetu.py
deleted file mode 100644
index 7c3eee336..000000000
--- a/spacy/lang/ja/bunsetu.py
+++ /dev/null
@@ -1,144 +0,0 @@
-# coding: utf8
-from __future__ import unicode_literals
-
-from .stop_words import STOP_WORDS
-
-
-POS_PHRASE_MAP = {
- "NOUN": "NP",
- "NUM": "NP",
- "PRON": "NP",
- "PROPN": "NP",
-
- "VERB": "VP",
-
- "ADJ": "ADJP",
-
- "ADV": "ADVP",
-
- "CCONJ": "CCONJP",
-}
-
-
-# return value: [(bunsetu_tokens, phrase_type={'NP', 'VP', 'ADJP', 'ADVP'}, phrase_tokens)]
-def yield_bunsetu(doc, debug=False):
- bunsetu = []
- bunsetu_may_end = False
- phrase_type = None
- phrase = None
- prev = None
- prev_tag = None
- prev_dep = None
- prev_head = None
- for t in doc:
- pos = t.pos_
- pos_type = POS_PHRASE_MAP.get(pos, None)
- tag = t.tag_
- dep = t.dep_
- head = t.head.i
- if debug:
- print(t.i, t.orth_, pos, pos_type, dep, head, bunsetu_may_end, phrase_type, phrase, bunsetu)
-
- # DET is always an individual bunsetu
- if pos == "DET":
- if bunsetu:
- yield bunsetu, phrase_type, phrase
- yield [t], None, None
- bunsetu = []
- bunsetu_may_end = False
- phrase_type = None
- phrase = None
-
- # PRON or Open PUNCT always splits bunsetu
- elif tag == "補助記号-括弧開":
- if bunsetu:
- yield bunsetu, phrase_type, phrase
- bunsetu = [t]
- bunsetu_may_end = True
- phrase_type = None
- phrase = None
-
- # bunsetu head not appeared
- elif phrase_type is None:
- if bunsetu and prev_tag == "補助記号-読点":
- yield bunsetu, phrase_type, phrase
- bunsetu = []
- bunsetu_may_end = False
- phrase_type = None
- phrase = None
- bunsetu.append(t)
- if pos_type: # begin phrase
- phrase = [t]
- phrase_type = pos_type
- if pos_type in {"ADVP", "CCONJP"}:
- bunsetu_may_end = True
-
- # entering new bunsetu
- elif pos_type and (
- pos_type != phrase_type or # different phrase type arises
- bunsetu_may_end # same phrase type but bunsetu already ended
- ):
- # exceptional case: NOUN to VERB
- if phrase_type == "NP" and pos_type == "VP" and prev_dep == 'compound' and prev_head == t.i:
- bunsetu.append(t)
- phrase_type = "VP"
- phrase.append(t)
- # exceptional case: VERB to NOUN
- elif phrase_type == "VP" and pos_type == "NP" and (
- prev_dep == 'compound' and prev_head == t.i or
- dep == 'compound' and prev == head or
- prev_dep == 'nmod' and prev_head == t.i
- ):
- bunsetu.append(t)
- phrase_type = "NP"
- phrase.append(t)
- else:
- yield bunsetu, phrase_type, phrase
- bunsetu = [t]
- bunsetu_may_end = False
- phrase_type = pos_type
- phrase = [t]
-
- # NOUN bunsetu
- elif phrase_type == "NP":
- bunsetu.append(t)
- if not bunsetu_may_end and ((
- (pos_type == "NP" or pos == "SYM") and (prev_head == t.i or prev_head == head) and prev_dep in {'compound', 'nummod'}
- ) or (
- pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
- )):
- phrase.append(t)
- else:
- bunsetu_may_end = True
-
- # VERB bunsetu
- elif phrase_type == "VP":
- bunsetu.append(t)
- if not bunsetu_may_end and pos == "VERB" and prev_head == t.i and prev_dep == 'compound':
- phrase.append(t)
- else:
- bunsetu_may_end = True
-
- # ADJ bunsetu
- elif phrase_type == "ADJP" and tag != '連体詞':
- bunsetu.append(t)
- if not bunsetu_may_end and ((
- pos == "NOUN" and (prev_head == t.i or prev_head == head) and prev_dep in {'amod', 'compound'}
- ) or (
- pos == "PART" and (prev == head or prev_head == head) and dep == 'mark'
- )):
- phrase.append(t)
- else:
- bunsetu_may_end = True
-
- # other bunsetu
- else:
- bunsetu.append(t)
-
- prev = t.i
- prev_tag = t.tag_
- prev_dep = t.dep_
- prev_head = head
-
- if bunsetu:
- yield bunsetu, phrase_type, phrase
diff --git a/spacy/lang/ne/__init__.py b/spacy/lang/ne/__init__.py
new file mode 100644
index 000000000..21556277d
--- /dev/null
+++ b/spacy/lang/ne/__init__.py
@@ -0,0 +1,23 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from .stop_words import STOP_WORDS
+from .lex_attrs import LEX_ATTRS
+
+from ...language import Language
+from ...attrs import LANG
+
+
+class NepaliDefaults(Language.Defaults):
+ lex_attr_getters = dict(Language.Defaults.lex_attr_getters)
+ lex_attr_getters.update(LEX_ATTRS)
+ lex_attr_getters[LANG] = lambda text: "ne" # Nepali language ISO code
+ stop_words = STOP_WORDS
+
+
+class Nepali(Language):
+ lang = "ne"
+ Defaults = NepaliDefaults
+
+
+__all__ = ["Nepali"]
diff --git a/spacy/lang/ne/examples.py b/spacy/lang/ne/examples.py
new file mode 100644
index 000000000..b3c4f9e73
--- /dev/null
+++ b/spacy/lang/ne/examples.py
@@ -0,0 +1,22 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.ne.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+ "एप्पलले अमेरिकी स्टार्टअप १ अर्ब डलरमा किन्ने सोच्दै छ",
+ "स्वायत्त कारहरूले बीमा दायित्व निर्माताहरु तिर बदल्छन्",
+ "स्यान फ्रांसिस्कोले फुटपाथ वितरण रोबोटहरु प्रतिबंध गर्ने विचार गर्दै छ",
+ "लन्डन यूनाइटेड किंगडमको एक ठूलो शहर हो।",
+ "तिमी कहाँ छौ?",
+ "फ्रान्स को राष्ट्रपति को हो?",
+ "संयुक्त राज्यको राजधानी के हो?",
+ "बराक ओबामा कहिले कहिले जन्मेका हुन्?",
+]
diff --git a/spacy/lang/ne/lex_attrs.py b/spacy/lang/ne/lex_attrs.py
new file mode 100644
index 000000000..652307577
--- /dev/null
+++ b/spacy/lang/ne/lex_attrs.py
@@ -0,0 +1,98 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+from ..norm_exceptions import BASE_NORMS
+from ...attrs import NORM, LIKE_NUM
+
+
+# fmt: off
+_stem_suffixes = [
+ ["ा", "ि", "ी", "ु", "ू", "ृ", "े", "ै", "ो", "ौ"],
+ ["ँ", "ं", "्", "ः"],
+ ["लाई", "ले", "बाट", "को", "मा", "हरू"],
+ ["हरूलाई", "हरूले", "हरूबाट", "हरूको", "हरूमा"],
+ ["इलो", "िलो", "नु", "ाउनु", "ई", "इन", "इन्", "इनन्"],
+ ["एँ", "इँन्", "इस्", "इनस्", "यो", "एन", "यौं", "एनौं", "ए", "एनन्"],
+ ["छु", "छौँ", "छस्", "छौ", "छ", "छन्", "छेस्", "छे", "छ्यौ", "छिन्", "हुन्छ"],
+ ["दै", "दिन", "दिँन", "दैनस्", "दैन", "दैनौँ", "दैनौं", "दैनन्"],
+ ["हुन्न", "न्न", "न्न्स्", "न्नौं", "न्नौ", "न्न्न्", "िई"],
+ ["अ", "ओ", "ऊ", "अरी", "साथ", "वित्तिकै", "पूर्वक"],
+ ["याइ", "ाइ", "बार", "वार", "चाँहि"],
+ ["ने", "ेको", "ेकी", "ेका", "ेर", "दै", "तै", "िकन", "उ", "न", "नन्"]
+]
+# fmt: on
+
+# reference 1: https://en.wikipedia.org/wiki/Numbers_in_Nepali_language
+# reference 2: https://www.imnepal.com/nepali-numbers/
+_num_words = [
+ "शुन्य",
+ "एक",
+ "दुई",
+ "तीन",
+ "चार",
+ "पाँच",
+ "छ",
+ "सात",
+ "आठ",
+ "नौ",
+ "दश",
+ "एघार",
+ "बाह्र",
+ "तेह्र",
+ "चौध",
+ "पन्ध्र",
+ "सोह्र",
+ "सोह्र",
+ "सत्र",
+ "अठार",
+ "उन्नाइस",
+ "बीस",
+ "तीस",
+ "चालीस",
+ "पचास",
+ "साठी",
+ "सत्तरी",
+ "असी",
+ "नब्बे",
+ "सय",
+ "हजार",
+ "लाख",
+ "करोड",
+ "अर्ब",
+ "खर्ब",
+]
+
+
+def norm(string):
+ # normalise base exceptions, e.g. punctuation or currency symbols
+ if string in BASE_NORMS:
+ return BASE_NORMS[string]
+ # set stem word as norm, if available, adapted from:
+ # https://github.com/explosion/spaCy/blob/master/spacy/lang/hi/lex_attrs.py
+ # https://www.researchgate.net/publication/237261579_Structure_of_Nepali_Grammar
+ for suffix_group in reversed(_stem_suffixes):
+ length = len(suffix_group[0])
+ if len(string) <= length:
+ break
+ for suffix in suffix_group:
+ if string.endswith(suffix):
+ return string[:-length]
+ return string
+
+
+def like_num(text):
+ if text.startswith(("+", "-", "±", "~")):
+ text = text[1:]
+ text = text.replace(", ", "").replace(".", "")
+ if text.isdigit():
+ return True
+ if text.count("/") == 1:
+ num, denom = text.split("/")
+ if num.isdigit() and denom.isdigit():
+ return True
+ if text.lower() in _num_words:
+ return True
+ return False
+
+
+LEX_ATTRS = {NORM: norm, LIKE_NUM: like_num}
diff --git a/spacy/lang/ne/stop_words.py b/spacy/lang/ne/stop_words.py
new file mode 100644
index 000000000..f008697d0
--- /dev/null
+++ b/spacy/lang/ne/stop_words.py
@@ -0,0 +1,498 @@
+# coding: utf8
+from __future__ import unicode_literals
+
+
+# Source: https://github.com/sanjaalcorps/NepaliStopWords/blob/master/NepaliStopWords.txt
+
+STOP_WORDS = set(
+ """
+अक्सर
+अगाडि
+अगाडी
+अघि
+अझै
+अठार
+अथवा
+अनि
+अनुसार
+अन्तर्गत
+अन्य
+अन्यत्र
+अन्यथा
+अब
+अरु
+अरुलाई
+अरू
+अर्को
+अर्थात
+अर्थात्
+अलग
+अलि
+अवस्था
+अहिले
+आए
+आएका
+आएको
+आज
+आजको
+आठ
+आत्म
+आदि
+आदिलाई
+आफनो
+आफू
+आफूलाई
+आफै
+आफैँ
+आफ्नै
+आफ्नो
+आयो
+उ
+उक्त
+उदाहरण
+उनको
+उनलाई
+उनले
+उनि
+उनी
+उनीहरुको
+उन्नाइस
+उप
+उसको
+उसलाई
+उसले
+उहालाई
+ऊ
+एउटा
+एउटै
+एक
+एकदम
+एघार
+ओठ
+औ
+औं
+कता
+कति
+कतै
+कम
+कमसेकम
+कसरि
+कसरी
+कसै
+कसैको
+कसैलाई
+कसैले
+कसैसँग
+कस्तो
+कहाँबाट
+कहिलेकाहीं
+का
+काम
+कारण
+कि
+किन
+किनभने
+कुन
+कुनै
+कुन्नी
+कुरा
+कृपया
+के
+केहि
+केही
+को
+कोहि
+कोहिपनि
+कोही
+कोहीपनि
+क्रमशः
+गए
+गएको
+गएर
+गयौ
+गरि
+गरी
+गरे
+गरेका
+गरेको
+गरेर
+गरौं
+गर्छ
+गर्छन्
+गर्छु
+गर्दा
+गर्दै
+गर्न
+गर्नु
+गर्नुपर्छ
+गर्ने
+गैर
+घर
+चार
+चाले
+चाहनुहुन्छ
+चाहन्छु
+चाहिं
+चाहिए
+चाहिंले
+चाहीं
+चाहेको
+चाहेर
+चोटी
+चौथो
+चौध
+छ
+छन
+छन्
+छु
+छू
+छैन
+छैनन्
+छौ
+छौं
+जता
+जताततै
+जना
+जनाको
+जनालाई
+जनाले
+जब
+जबकि
+जबकी
+जसको
+जसबाट
+जसमा
+जसरी
+जसलाई
+जसले
+जस्ता
+जस्तै
+जस्तो
+जस्तोसुकै
+जहाँ
+जान
+जाने
+जाहिर
+जुन
+जुनै
+जे
+जो
+जोपनि
+जोपनी
+झैं
+ठाउँमा
+ठीक
+ठूलो
+त
+तता
+तत्काल
+तथा
+तथापि
+तथापी
+तदनुसार
+तपाइ
+तपाई
+तपाईको
+तब
+तर
+तर्फ
+तल
+तसरी
+तापनि
+तापनी
+तिन
+तिनि
+तिनिहरुलाई
+तिनी
+तिनीहरु
+तिनीहरुको
+तिनीहरू
+तिनीहरूको
+तिनै
+तिमी
+तिर
+तिरको
+ती
+तीन
+तुरन्त
+तुरुन्त
+तुरुन्तै
+तेश्रो
+तेस्कारण
+तेस्रो
+तेह्र
+तैपनि
+तैपनी
+त्यत्तिकै
+त्यत्तिकैमा
+त्यस
+त्यसकारण
+त्यसको
+त्यसले
+त्यसैले
+त्यसो
+त्यस्तै
+त्यस्तो
+त्यहाँ
+त्यहिँ
+त्यही
+त्यहीँ
+त्यहीं
+त्यो
+त्सपछि
+त्सैले
+थप
+थरि
+थरी
+थाहा
+थिए
+थिएँ
+थिएन
+थियो
+दर्ता
+दश
+दिए
+दिएको
+दिन
+दिनुभएको
+दिनुहुन्छ
+दुइ
+दुइवटा
+दुई
+देखि
+देखिन्छ
+देखियो
+देखे
+देखेको
+देखेर
+दोश्री
+दोश्रो
+दोस्रो
+द्वारा
+धन्न
+धेरै
+धौ
+न
+नगर्नु
+नगर्नू
+नजिकै
+नत्र
+नत्रभने
+नभई
+नभएको
+नभनेर
+नयाँ
+नि
+निकै
+निम्ति
+निम्न
+निम्नानुसार
+निर्दिष्ट
+नै
+नौ
+पक्का
+पक्कै
+पछाडि
+पछाडी
+पछि
+पछिल्लो
+पछी
+पटक
+पनि
+पन्ध्र
+पर्छ
+पर्थ्यो
+पर्दैन
+पर्ने
+पर्नेमा
+पर्याप्त
+पहिले
+पहिलो
+पहिल्यै
+पाँच
+पांच
+पाचौँ
+पाँचौं
+पिच्छे
+पूर्व
+पो
+प्रति
+प्रतेक
+प्रत्यक
+प्राय
+प्लस
+फरक
+फेरि
+फेरी
+बढी
+बताए
+बने
+बरु
+बाट
+बारे
+बाहिर
+बाहेक
+बाह्र
+बिच
+बिचमा
+बिरुद्ध
+बिशेष
+बिस
+बीच
+बीचमा
+बीस
+भए
+भएँ
+भएका
+भएकालाई
+भएको
+भएन
+भएर
+भन
+भने
+भनेको
+भनेर
+भन्
+भन्छन्
+भन्छु
+भन्दा
+भन्दै
+भन्नुभयो
+भन्ने
+भन्या
+भयेन
+भयो
+भर
+भरि
+भरी
+भा
+भित्र
+भित्री
+भीत्र
+म
+मध्य
+मध्ये
+मलाई
+मा
+मात्र
+मात्रै
+माथि
+माथी
+मुख्य
+मुनि
+मुन्तिर
+मेरो
+मैले
+यति
+यथोचित
+यदि
+यद्ध्यपि
+यद्यपि
+यस
+यसका
+यसको
+यसपछि
+यसबाहेक
+यसमा
+यसरी
+यसले
+यसो
+यस्तै
+यस्तो
+यहाँ
+यहाँसम्म
+यही
+या
+यी
+यो
+र
+रही
+रहेका
+रहेको
+रहेछ
+राखे
+राख्छ
+राम्रो
+रुपमा
+रूप
+रे
+लगभग
+लगायत
+लाई
+लाख
+लागि
+लागेको
+ले
+वटा
+वरीपरी
+वा
+वाट
+वापत
+वास्तवमा
+शायद
+सक्छ
+सक्ने
+सँग
+संग
+सँगको
+सँगसँगै
+सँगै
+संगै
+सङ्ग
+सङ्गको
+सट्टा
+सत्र
+सधै
+सबै
+सबैको
+सबैलाई
+समय
+समेत
+सम्भव
+सम्म
+सय
+सरह
+सहित
+सहितै
+सही
+साँच्चै
+सात
+साथ
+साथै
+सायद
+सारा
+सुनेको
+सुनेर
+सुरु
+सुरुको
+सुरुमै
+सो
+सोचेको
+सोचेर
+सोही
+सोह्र
+स्थित
+स्पष्ट
+हजार
+हरे
+हरेक
+हामी
+हामीले
+हाम्रा
+हाम्रो
+हुँदैन
+हुन
+हुनत
+हुनु
+हुने
+हुनेछ
+हुन्
+हुन्छ
+हुन्थ्यो
+हैन
+हो
+होइन
+होकि
+होला
+""".split()
+)
diff --git a/spacy/language.py b/spacy/language.py
index 2058def8a..faa0447a4 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -46,7 +46,7 @@ class BaseDefaults(object):
def create_lemmatizer(cls, nlp=None, lookups=None):
if lookups is None:
lookups = cls.create_lookups(nlp=nlp)
- return Lemmatizer(lookups=lookups)
+ return Lemmatizer(lookups=lookups, is_base_form=cls.is_base_form)
@classmethod
def create_lookups(cls, nlp=None):
@@ -120,6 +120,7 @@ class BaseDefaults(object):
tokenizer_exceptions = {}
stop_words = set()
morph_rules = {}
+ is_base_form = None
lex_attr_getters = LEX_ATTRS
syntax_iterators = {}
resources = {}
diff --git a/spacy/lemmatizer.py b/spacy/lemmatizer.py
index 1f0f0da3f..f72eae128 100644
--- a/spacy/lemmatizer.py
+++ b/spacy/lemmatizer.py
@@ -21,7 +21,7 @@ class Lemmatizer(object):
def load(cls, *args, **kwargs):
raise NotImplementedError(Errors.E172)
- def __init__(self, lookups, *args, **kwargs):
+ def __init__(self, lookups, *args, is_base_form=None, **kwargs):
"""Initialize a Lemmatizer.
lookups (Lookups): The lookups object containing the (optional) tables
@@ -31,6 +31,7 @@ class Lemmatizer(object):
if args or kwargs or not isinstance(lookups, Lookups):
raise ValueError(Errors.E173)
self.lookups = lookups
+ self.is_base_form = is_base_form
def __call__(self, string, univ_pos, morphology=None):
"""Lemmatize a string.
@@ -51,7 +52,7 @@ class Lemmatizer(object):
if univ_pos in ("", "eol", "space"):
return [string.lower()]
# See Issue #435 for example of where this logic is requied.
- if self.is_base_form(univ_pos, morphology):
+ if callable(self.is_base_form) and self.is_base_form(univ_pos, morphology):
return [string.lower()]
index_table = self.lookups.get_table("lemma_index", {})
exc_table = self.lookups.get_table("lemma_exc", {})
@@ -69,40 +70,6 @@ class Lemmatizer(object):
)
return lemmas
- def is_base_form(self, univ_pos, morphology=None):
- """
- Check whether we're dealing with an uninflected paradigm, so we can
- avoid lemmatization entirely.
-
- univ_pos (unicode / int): The token's universal part-of-speech tag.
- morphology (dict): The token's morphological features following the
- Universal Dependencies scheme.
- """
- if morphology is None:
- morphology = {}
- if univ_pos == "noun" and morphology.get("Number") == "sing":
- return True
- elif univ_pos == "verb" and morphology.get("VerbForm") == "inf":
- return True
- # This maps 'VBP' to base form -- probably just need 'IS_BASE'
- # morphology
- elif univ_pos == "verb" and (
- morphology.get("VerbForm") == "fin"
- and morphology.get("Tense") == "pres"
- and morphology.get("Number") is None
- ):
- return True
- elif univ_pos == "adj" and morphology.get("Degree") == "pos":
- return True
- elif morphology.get("VerbForm") == "inf":
- return True
- elif morphology.get("VerbForm") == "none":
- return True
- elif morphology.get("Degree") == "pos":
- return True
- else:
- return False
-
def noun(self, string, morphology=None):
return self(string, "noun", morphology)
diff --git a/spacy/lexeme.pyx b/spacy/lexeme.pyx
index 1df516dcb..8042098d7 100644
--- a/spacy/lexeme.pyx
+++ b/spacy/lexeme.pyx
@@ -349,7 +349,7 @@ cdef class Lexeme:
@property
def is_oov(self):
"""RETURNS (bool): Whether the lexeme is out-of-vocabulary."""
- return self.orth in self.vocab.vectors
+ return self.orth not in self.vocab.vectors
property is_stop:
"""RETURNS (bool): Whether the lexeme is a stop word."""
diff --git a/spacy/lookups.py b/spacy/lookups.py
index 1fa29bdfe..d4947be9f 100644
--- a/spacy/lookups.py
+++ b/spacy/lookups.py
@@ -120,8 +120,7 @@ class Lookups(object):
"""
self._tables = OrderedDict()
for key, value in srsly.msgpack_loads(bytes_data).items():
- self._tables[key] = Table(key)
- self._tables[key].update(value)
+ self._tables[key] = Table(key, value)
return self
def to_disk(self, path, filename="lookups.bin", **kwargs):
@@ -192,7 +191,7 @@ class Table(OrderedDict):
self.name = name
# Assume a default size of 1M items
self.default_size = 1e6
- size = len(data) if data and len(data) > 0 else self.default_size
+ size = max(len(data), 1) if data is not None else self.default_size
self.bloom = BloomFilter.from_error_rate(size)
if data:
self.update(data)
diff --git a/spacy/pipeline/pipes.pyx b/spacy/pipeline/pipes.pyx
index 3f40cb545..8f07bf8f7 100644
--- a/spacy/pipeline/pipes.pyx
+++ b/spacy/pipeline/pipes.pyx
@@ -528,10 +528,10 @@ class Tagger(Pipe):
new_tag_map[tag] = orig_tag_map[tag]
else:
new_tag_map[tag] = {POS: X}
- if "_SP" in orig_tag_map:
- new_tag_map["_SP"] = orig_tag_map["_SP"]
cdef Vocab vocab = self.vocab
if new_tag_map:
+ if "_SP" in orig_tag_map:
+ new_tag_map["_SP"] = orig_tag_map["_SP"]
vocab.morphology = Morphology(vocab.strings, new_tag_map,
vocab.morphology.lemmatizer,
exc=vocab.morphology.exc)
diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py
index 1f13da5d6..91b7e4d9d 100644
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@@ -170,6 +170,11 @@ def nb_tokenizer():
return get_lang_class("nb").Defaults.create_tokenizer()
+@pytest.fixture(scope="session")
+def ne_tokenizer():
+ return get_lang_class("ne").Defaults.create_tokenizer()
+
+
@pytest.fixture(scope="session")
def nl_tokenizer():
return get_lang_class("nl").Defaults.create_tokenizer()
diff --git a/spacy/tests/lang/ja/test_tokenizer.py b/spacy/tests/lang/ja/test_tokenizer.py
index 26be5cf59..651e906eb 100644
--- a/spacy/tests/lang/ja/test_tokenizer.py
+++ b/spacy/tests/lang/ja/test_tokenizer.py
@@ -4,7 +4,7 @@ from __future__ import unicode_literals
import pytest
from ...tokenizer.test_naughty_strings import NAUGHTY_STRINGS
-from spacy.lang.ja import Japanese
+from spacy.lang.ja import Japanese, DetailedToken
# fmt: off
TOKENIZER_TESTS = [
@@ -96,6 +96,57 @@ def test_ja_tokenizer_split_modes(ja_tokenizer, text, len_a, len_b, len_c):
assert len(nlp_c(text)) == len_c
+@pytest.mark.parametrize("text,sub_tokens_list_a,sub_tokens_list_b,sub_tokens_list_c",
+ [
+ (
+ "選挙管理委員会",
+ [None, None, None, None],
+ [None, None, [
+ [
+ DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
+ DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
+ ]
+ ]],
+ [[
+ [
+ DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
+ DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
+ DetailedToken(surface='委員', tag='名詞-普通名詞-一般', inf='', lemma='委員', reading='イイン', sub_tokens=None),
+ DetailedToken(surface='会', tag='名詞-普通名詞-一般', inf='', lemma='会', reading='カイ', sub_tokens=None),
+ ], [
+ DetailedToken(surface='選挙', tag='名詞-普通名詞-サ変可能', inf='', lemma='選挙', reading='センキョ', sub_tokens=None),
+ DetailedToken(surface='管理', tag='名詞-普通名詞-サ変可能', inf='', lemma='管理', reading='カンリ', sub_tokens=None),
+ DetailedToken(surface='委員会', tag='名詞-普通名詞-一般', inf='', lemma='委員会', reading='イインカイ', sub_tokens=None),
+ ]
+ ]]
+ ),
+ ]
+)
+def test_ja_tokenizer_sub_tokens(ja_tokenizer, text, sub_tokens_list_a, sub_tokens_list_b, sub_tokens_list_c):
+ nlp_a = Japanese(meta={"tokenizer": {"config": {"split_mode": "A"}}})
+ nlp_b = Japanese(meta={"tokenizer": {"config": {"split_mode": "B"}}})
+ nlp_c = Japanese(meta={"tokenizer": {"config": {"split_mode": "C"}}})
+
+ assert ja_tokenizer(text).user_data["sub_tokens"] == sub_tokens_list_a
+ assert nlp_a(text).user_data["sub_tokens"] == sub_tokens_list_a
+ assert nlp_b(text).user_data["sub_tokens"] == sub_tokens_list_b
+ assert nlp_c(text).user_data["sub_tokens"] == sub_tokens_list_c
+
+
+@pytest.mark.parametrize("text,inflections,reading_forms",
+ [
+ (
+ "取ってつけた",
+ ("五段-ラ行,連用形-促音便", "", "下一段-カ行,連用形-一般", "助動詞-タ,終止形-一般"),
+ ("トッ", "テ", "ツケ", "タ"),
+ ),
+ ]
+)
+def test_ja_tokenizer_inflections_reading_forms(ja_tokenizer, text, inflections, reading_forms):
+ assert ja_tokenizer(text).user_data["inflections"] == inflections
+ assert ja_tokenizer(text).user_data["reading_forms"] == reading_forms
+
+
def test_ja_tokenizer_emptyish_texts(ja_tokenizer):
doc = ja_tokenizer("")
assert len(doc) == 0
diff --git a/spacy/tests/lang/ne/__init__.py b/spacy/tests/lang/ne/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/spacy/tests/lang/ne/test_text.py b/spacy/tests/lang/ne/test_text.py
new file mode 100644
index 000000000..926a7de04
--- /dev/null
+++ b/spacy/tests/lang/ne/test_text.py
@@ -0,0 +1,19 @@
+# coding: utf-8
+from __future__ import unicode_literals
+
+import pytest
+
+
+def test_ne_tokenizer_handlers_long_text(ne_tokenizer):
+ text = """मैले पाएको सर्टिफिकेटलाई म त बोक्रो सम्झन्छु र अभ्यास तब सुरु भयो, जब मैले कलेज पार गरेँ र जीवनको पढाइ सुरु गरेँ ।"""
+ tokens = ne_tokenizer(text)
+ assert len(tokens) == 24
+
+
+@pytest.mark.parametrize(
+ "text,length",
+ [("समय जान कति पनि बेर लाग्दैन ।", 7), ("म ठूलो हुँदै थिएँ ।", 5)],
+)
+def test_ne_tokenizer_handles_cnts(ne_tokenizer, text, length):
+ tokens = ne_tokenizer(text)
+ assert len(tokens) == length
\ No newline at end of file
diff --git a/spacy/tests/pipeline/test_tagger.py b/spacy/tests/pipeline/test_tagger.py
index a5bda9090..1681ffeaa 100644
--- a/spacy/tests/pipeline/test_tagger.py
+++ b/spacy/tests/pipeline/test_tagger.py
@@ -3,6 +3,7 @@ from __future__ import unicode_literals
import pytest
from spacy.language import Language
+from spacy.symbols import POS, NOUN
def test_label_types():
@@ -11,3 +12,16 @@ def test_label_types():
nlp.get_pipe("tagger").add_label("A")
with pytest.raises(ValueError):
nlp.get_pipe("tagger").add_label(9)
+
+
+def test_tagger_begin_training_tag_map():
+ """Test that Tagger.begin_training() without gold tuples does not clobber
+ the tag map."""
+ nlp = Language()
+ tagger = nlp.create_pipe("tagger")
+ orig_tag_count = len(tagger.labels)
+ tagger.add_label("A", {"POS": "NOUN"})
+ nlp.add_pipe(tagger)
+ nlp.begin_training()
+ assert nlp.vocab.morphology.tag_map["A"] == {POS: NOUN}
+ assert orig_tag_count + 1 == len(nlp.get_pipe("tagger").labels)
diff --git a/spacy/tests/regression/test_issue1-1000.py b/spacy/tests/regression/test_issue1-1000.py
index 6d88d68c2..38a99371e 100644
--- a/spacy/tests/regression/test_issue1-1000.py
+++ b/spacy/tests/regression/test_issue1-1000.py
@@ -11,6 +11,7 @@ from spacy.language import Language
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
from spacy.tokens import Doc, Span
+from spacy.lang.en import EnglishDefaults
from ..util import get_doc, make_tempdir
@@ -172,7 +173,7 @@ def test_issue595():
lookups.add_table("lemma_rules", {"verb": [["ed", "e"]]})
lookups.add_table("lemma_index", {"verb": {}})
lookups.add_table("lemma_exc", {"verb": {}})
- lemmatizer = Lemmatizer(lookups)
+ lemmatizer = Lemmatizer(lookups, is_base_form=EnglishDefaults.is_base_form)
vocab = Vocab(lemmatizer=lemmatizer, tag_map=tag_map)
doc = Doc(vocab, words=words)
doc[2].tag_ = "VB"
diff --git a/spacy/tests/test_lemmatizer.py b/spacy/tests/test_lemmatizer.py
index fce3772c4..e7736b042 100644
--- a/spacy/tests/test_lemmatizer.py
+++ b/spacy/tests/test_lemmatizer.py
@@ -5,6 +5,7 @@ import pytest
from spacy.tokens import Doc
from spacy.language import Language
from spacy.lookups import Lookups
+from spacy.lemmatizer import Lemmatizer
def test_lemmatizer_reflects_lookups_changes():
@@ -47,3 +48,14 @@ def test_tagger_warns_no_lookups():
with pytest.warns(None) as record:
nlp.begin_training()
assert not record.list
+
+
+def test_lemmatizer_without_is_base_form_implementation():
+ # Norwegian example from #5658
+ lookups = Lookups()
+ lookups.add_table("lemma_rules", {"noun": []})
+ lookups.add_table("lemma_index", {"noun": {}})
+ lookups.add_table("lemma_exc", {"noun": {"formuesskatten": ["formuesskatt"]}})
+
+ lemmatizer = Lemmatizer(lookups, is_base_form=None)
+ assert lemmatizer("Formuesskatten", "noun", {'Definite': 'def', 'Gender': 'masc', 'Number': 'sing'}) == ["formuesskatt"]
diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py
index 576ca93d2..b31cef1f2 100644
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@@ -376,6 +376,6 @@ def test_vector_is_oov():
data[1] = 2.0
vocab.set_vector("cat", data[0])
vocab.set_vector("dog", data[1])
- assert vocab["cat"].is_oov is True
- assert vocab["dog"].is_oov is True
- assert vocab["hamster"].is_oov is False
+ assert vocab["cat"].is_oov is False
+ assert vocab["dog"].is_oov is False
+ assert vocab["hamster"].is_oov is True
diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx
index 45deebc93..8d3406bae 100644
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@@ -923,7 +923,7 @@ cdef class Token:
@property
def is_oov(self):
"""RETURNS (bool): Whether the token is out-of-vocabulary."""
- return self.c.lex.orth in self.vocab.vectors
+ return self.c.lex.orth not in self.vocab.vectors
@property
def is_stop(self):
diff --git a/spacy/util.py b/spacy/util.py
index 5362952e2..923f56b31 100644
--- a/spacy/util.py
+++ b/spacy/util.py
@@ -208,6 +208,10 @@ def load_model_from_path(model_path, meta=False, **overrides):
pipeline = nlp.Defaults.pipe_names
elif pipeline in (False, None):
pipeline = []
+ # skip "vocab" from overrides in component initialization since vocab is
+ # already configured from overrides when nlp is initialized above
+ if "vocab" in overrides:
+ del overrides["vocab"]
for name in pipeline:
if name not in disable:
config = meta.get("pipeline_args", {}).get(name, {})
diff --git a/website/docs/api/goldparse.md b/website/docs/api/goldparse.md
index 5df625991..bc33dd4e6 100644
--- a/website/docs/api/goldparse.md
+++ b/website/docs/api/goldparse.md
@@ -12,18 +12,18 @@ expects true examples of a label to have the value `1.0`, and negative examples
of a label to have the value `0.0`. Labels not in the dictionary are treated as
missing – the gradient for those labels will be zero.
-| Name | Type | Description |
-| ----------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `doc` | `Doc` | The document the annotations refer to. |
-| `words` | iterable | A sequence of unicode word strings. |
-| `tags` | iterable | A sequence of strings, representing tag annotations. |
-| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
-| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
-| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
-| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
-| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
-| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False.`. |
-| **RETURNS** | `GoldParse` | The newly constructed object. |
+| Name | Type | Description |
+| ----------------- | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `doc` | `Doc` | The document the annotations refer to. |
+| `words` | iterable | A sequence of unicode word strings. |
+| `tags` | iterable | A sequence of strings, representing tag annotations. |
+| `heads` | iterable | A sequence of integers, representing syntactic head offsets. |
+| `deps` | iterable | A sequence of strings, representing the syntactic relation types. |
+| `entities` | iterable | A sequence of named entity annotations, either as BILUO tag strings, or as `(start_char, end_char, label)` tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None. |
+| `cats` | dict | Labels for text classification. Each key in the dictionary is a string label for the category and each value is `1.0` (positive) or `0.0` (negative). |
+| `links` | dict | Labels for entity linking. A dict with `(start_char, end_char)` keys, and the values being dicts with `kb_id:value` entries, representing external KB IDs mapped to either `1.0` (positive) or `0.0` (negative). |
+| `make_projective` | bool | Whether to projectivize the dependency tree. Defaults to `False`. |
+| **RETURNS** | `GoldParse` | The newly constructed object. |
## GoldParse.\_\_len\_\_ {#len tag="method"}
@@ -43,17 +43,17 @@ Whether the provided syntactic annotations form a projective dependency tree.
## Attributes {#attributes}
-| Name | Type | Description |
-| ------------------------------------ | ---- | -------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `words` | list | The words. |
-| `tags` | list | The part-of-speech tag annotations. |
-| `heads` | list | The syntactic head annotations. |
-| `labels` | list | The syntactic relation-type annotations. |
-| `ner` | list | The named entity annotations as BILUO tags. |
-| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
-| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
-| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
-| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
+| Name | Type | Description |
+| ------------------------------------ | ---- | ------------------------------------------------------------------------------------------------------------------------ |
+| `words` | list | The words. |
+| `tags` | list | The part-of-speech tag annotations. |
+| `heads` | list | The syntactic head annotations. |
+| `labels` | list | The syntactic relation-type annotations. |
+| `ner` | list | The named entity annotations as BILUO tags. |
+| `cand_to_gold` | list | The alignment from candidate tokenization to gold tokenization. |
+| `gold_to_cand` | list | The alignment from gold tokenization to candidate tokenization. |
+| `cats` 2 | dict | Keys in the dictionary are string category labels with values `1.0` or `0.0`. |
+| `links` 2.2 | dict | Keys in the dictionary are `(start_char, end_char)` triples, and the values are dictionaries with `kb_id:value` entries. |
## Utilities {#util}
@@ -61,7 +61,8 @@ Whether the provided syntactic annotations form a projective dependency tree.
Convert a list of Doc objects into the
[JSON-serializable format](/api/annotation#json-input) used by the
-[`spacy train`](/api/cli#train) command. Each input doc will be treated as a 'paragraph' in the output doc.
+[`spacy train`](/api/cli#train) command. Each input doc will be treated as a
+'paragraph' in the output doc.
> #### Example
>
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index ac2f898e0..7b195e352 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -57,7 +57,7 @@ spaCy v2.3, the `Matcher` can also be called on `Span` objects.
| Name | Type | Description |
| ----------- | ------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3).. |
+| `doclike` | `Doc`/`Span` | The document to match over or a `Span` (as of v2.3). |
| **RETURNS** | list | A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end`]. The `match_id` is the ID of the added match pattern. |
diff --git a/website/docs/usage/101/_pos-deps.md b/website/docs/usage/101/_pos-deps.md
index 1a438e424..1e8960edf 100644
--- a/website/docs/usage/101/_pos-deps.md
+++ b/website/docs/usage/101/_pos-deps.md
@@ -36,7 +36,7 @@ for token in doc:
| Text | Lemma | POS | Tag | Dep | Shape | alpha | stop |
| ------- | ------- | ------- | ----- | ---------- | ------- | ------- | ------- |
| Apple | apple | `PROPN` | `NNP` | `nsubj` | `Xxxxx` | `True` | `False` |
-| is | be | `VERB` | `VBZ` | `aux` | `xx` | `True` | `True` |
+| is | be | `AUX` | `VBZ` | `aux` | `xx` | `True` | `True` |
| looking | look | `VERB` | `VBG` | `ROOT` | `xxxx` | `True` | `False` |
| at | at | `ADP` | `IN` | `prep` | `xx` | `True` | `True` |
| buying | buy | `VERB` | `VBG` | `pcomp` | `xxxx` | `True` | `False` |
diff --git a/website/docs/usage/adding-languages.md b/website/docs/usage/adding-languages.md
index d42aad705..29a9a1c27 100644
--- a/website/docs/usage/adding-languages.md
+++ b/website/docs/usage/adding-languages.md
@@ -662,7 +662,7 @@ One thing to keep in mind is that spaCy expects to train its models from **whole
documents**, not just single sentences. If your corpus only contains single
sentences, spaCy's models will never learn to expect multi-sentence documents,
leading to low performance on real text. To mitigate this problem, you can use
-the `-N` argument to the `spacy convert` command, to merge some of the sentences
+the `-n` argument to the `spacy convert` command, to merge some of the sentences
into longer pseudo-documents.
### Training the tagger and parser {#train-tagger-parser}
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index 84bb3d71b..9031a356f 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -471,7 +471,7 @@ doc = nlp.make_doc("London is a big city in the United Kingdom.")
print("Before", doc.ents) # []
header = [ENT_IOB, ENT_TYPE]
-attr_array = numpy.zeros((len(doc), len(header)))
+attr_array = numpy.zeros((len(doc), len(header)), dtype="uint64")
attr_array[0, 0] = 3 # B
attr_array[0, 1] = doc.vocab.strings["GPE"]
doc.from_array(header, attr_array)
@@ -1143,9 +1143,9 @@ from spacy.gold import align
other_tokens = ["i", "listened", "to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
cost, a2b, b2a, a2b_multi, b2a_multi = align(other_tokens, spacy_tokens)
-print("Misaligned tokens:", cost) # 2
+print("Edit distance:", cost) # 3
print("One-to-one mappings a -> b", a2b) # array([0, 1, 2, 3, -1, -1, 5, 6])
-print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, 5, 6, 7])
+print("One-to-one mappings b -> a", b2a) # array([0, 1, 2, 3, -1, 6, 7])
print("Many-to-one mappings a -> b", a2b_multi) # {4: 4, 5: 4}
print("Many-to-one mappings b-> a", b2a_multi) # {}
```
@@ -1153,7 +1153,7 @@ print("Many-to-one mappings b-> a", b2a_multi) # {}
Here are some insights from the alignment information generated in the example
above:
-- Two tokens are misaligned.
+- The edit distance (cost) is `3`: two deletions and one insertion.
- The one-to-one mappings for the first four tokens are identical, which means
they map to each other. This makes sense because they're also identical in the
input: `"i"`, `"listened"`, `"to"` and `"obama"`.
diff --git a/website/docs/usage/models.md b/website/docs/usage/models.md
index 382193157..b11e6347a 100644
--- a/website/docs/usage/models.md
+++ b/website/docs/usage/models.md
@@ -117,6 +117,18 @@ The Chinese language class supports three word segmentation options:
better segmentation for Chinese OntoNotes and the new
[Chinese models](/models/zh).
+
+
+Note that [`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship
+with pre-compiled wheels for Python 3.8. If you're running Python 3.8, you can
+install it from our fork and compile it locally:
+
+```bash
+$ pip install https://github.com/honnibal/pkuseg-python/archive/master.zip
+```
+
+
+
The `meta` argument of the `Chinese` language class supports the following
@@ -196,12 +208,20 @@ nlp = Chinese(meta={"tokenizer": {"config": {"pkuseg_model": "/path/to/pkuseg_mo
The Japanese language class uses
[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. The default Japanese language class
-and the provided Japanese models use SudachiPy split mode `A`.
+segmentation and part-of-speech tagging. The default Japanese language class and
+the provided Japanese models use SudachiPy split mode `A`.
The `meta` argument of the `Japanese` language class can be used to configure
the split mode to `A`, `B` or `C`.
+
+
+If you run into errors related to `sudachipy`, which is currently under active
+development, we suggest downgrading to `sudachipy==0.4.5`, which is the version
+used for training the current [Japanese models](/models/ja).
+
+
+
## Installing and using models {#download}
> #### Downloading models in spaCy < v1.7
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index 1db2405d1..f7866fe31 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -1158,17 +1158,17 @@ what you need for your application.
> available corpus.
For example, the corpus spaCy's [English models](/models/en) were trained on
-defines a `PERSON` entity as just the **person name**, without titles like "Mr"
-or "Dr". This makes sense, because it makes it easier to resolve the entity type
-back to a knowledge base. But what if your application needs the full names,
-_including_ the titles?
+defines a `PERSON` entity as just the **person name**, without titles like "Mr."
+or "Dr.". This makes sense, because it makes it easier to resolve the entity
+type back to a knowledge base. But what if your application needs the full
+names, _including_ the titles?
```python
### {executable="true"}
import spacy
nlp = spacy.load("en_core_web_sm")
-doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```
@@ -1233,7 +1233,7 @@ def expand_person_entities(doc):
# Add the component after the named entity recognizer
nlp.add_pipe(expand_person_entities, after='ner')
-doc = nlp("Dr Alex Smith chaired first board meeting of Acme Corp Inc.")
+doc = nlp("Dr. Alex Smith chaired first board meeting of Acme Corp Inc.")
print([(ent.text, ent.label_) for ent in doc.ents])
```
diff --git a/website/docs/usage/v2-3.md b/website/docs/usage/v2-3.md
index ba75b01ab..b6c4d7dfb 100644
--- a/website/docs/usage/v2-3.md
+++ b/website/docs/usage/v2-3.md
@@ -14,10 +14,10 @@ all language models, and decreased model size and loading times for models with
vectors. We've added pretrained models for **Chinese, Danish, Japanese, Polish
and Romanian** and updated the training data and vectors for most languages.
Model packages with vectors are about **2×** smaller on disk and load
-**2-4×** faster. For the full changelog, see the [release notes on
-GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0). For more
-details and a behind-the-scenes look at the new release, [see our blog
-post](https://explosion.ai/blog/spacy-v2-3).
+**2-4×** faster. For the full changelog, see the
+[release notes on GitHub](https://github.com/explosion/spaCy/releases/tag/v2.3.0).
+For more details and a behind-the-scenes look at the new release,
+[see our blog post](https://explosion.ai/blog/spacy-v2-3).
### Expanded model families with vectors {#models}
@@ -33,10 +33,10 @@ post](https://explosion.ai/blog/spacy-v2-3).
With new model families for Chinese, Danish, Polish, Romanian and Chinese plus
`md` and `lg` models with word vectors for all languages, this release provides
-a total of 46 model packages. For models trained using [Universal
-Dependencies](https://universaldependencies.org) corpora, the training data has
-been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish) and Dutch has been
-extended to include both UD Dutch Alpino and LassySmall.
+a total of 46 model packages. For models trained using
+[Universal Dependencies](https://universaldependencies.org) corpora, the
+training data has been updated to UD v2.5 (v2.6 for Japanese, v2.3 for Polish)
+and Dutch has been extended to include both UD Dutch Alpino and LassySmall.
@@ -48,6 +48,7 @@ extended to include both UD Dutch Alpino and LassySmall.
### Chinese {#chinese}
> #### Example
+>
> ```python
> from spacy.lang.zh import Chinese
>
@@ -57,41 +58,49 @@ extended to include both UD Dutch Alpino and LassySmall.
>
> # Append words to user dict
> nlp.tokenizer.pkuseg_update_user_dict(["中国", "ABC"])
+> ```
This release adds support for
-[pkuseg](https://github.com/lancopku/pkuseg-python) for word segmentation and
-the new Chinese models ship with a custom pkuseg model trained on OntoNotes.
-The Chinese tokenizer can be initialized with both `pkuseg` and custom models
-and the `pkuseg` user dictionary is easy to customize.
+[`pkuseg`](https://github.com/lancopku/pkuseg-python) for word segmentation and
+the new Chinese models ship with a custom pkuseg model trained on OntoNotes. The
+Chinese tokenizer can be initialized with both `pkuseg` and custom models and
+the `pkuseg` user dictionary is easy to customize. Note that
+[`pkuseg`](https://github.com/lancopku/pkuseg-python) doesn't yet ship with
+pre-compiled wheels for Python 3.8. See the
+[usage documentation](/usage/models#chinese) for details on how to install it on
+Python 3.8.
-**Chinese:** [Chinese tokenizer usage](/usage/models#chinese)
+**Models:** [Chinese models](/models/zh) **Usage: **
+[Chinese tokenizer usage](/usage/models#chinese)
### Japanese {#japanese}
The updated Japanese language class switches to
-[SudachiPy](https://github.com/WorksApplications/SudachiPy) for word
-segmentation and part-of-speech tagging. Using `sudachipy` greatly simplifies
+[`SudachiPy`](https://github.com/WorksApplications/SudachiPy) for word
+segmentation and part-of-speech tagging. Using `SudachiPy` greatly simplifies
installing spaCy for Japanese, which is now possible with a single command:
`pip install spacy[ja]`.
-**Japanese:** [Japanese tokenizer usage](/usage/models#japanese)
+**Models:** [Japanese models](/models/ja) **Usage:**
+[Japanese tokenizer usage](/usage/models#japanese)
### Small CLI updates
-- `spacy debug-data` provides the coverage of the vectors in a base model with
- `spacy debug-data lang train dev -b base_model`
-- `spacy evaluate` supports `blank:lg` (e.g. `spacy evaluate blank:en
- dev.json`) to evaluate the tokenization accuracy without loading a model
-- `spacy train` on GPU restricts the CPU timing evaluation to the first
- iteration
+- [`spacy debug-data`](/api/cli#debug-data) provides the coverage of the vectors
+ in a base model with `spacy debug-data lang train dev -b base_model`
+- [`spacy evaluate`](/api/cli#evaluate) supports `blank:lg` (e.g.
+ `spacy evaluate blank:en dev.json`) to evaluate the tokenization accuracy
+ without loading a model
+- [`spacy train`](/api/cli#train) on GPU restricts the CPU timing evaluation to
+ the first iteration
## Backwards incompatibilities {#incompat}
@@ -100,8 +109,8 @@ installing spaCy for Japanese, which is now possible with a single command:
If you've been training **your own models**, you'll need to **retrain** them
with the new version. Also don't forget to upgrade all models to the latest
versions. Models for earlier v2 releases (v2.0, v2.1, v2.2) aren't compatible
-with models for v2.3. To check if all of your models are up to date, you can
-run the [`spacy validate`](/api/cli#validate) command.
+with models for v2.3. To check if all of your models are up to date, you can run
+the [`spacy validate`](/api/cli#validate) command.
@@ -116,21 +125,20 @@ run the [`spacy validate`](/api/cli#validate) command.
> directly.
- If you're training new models, you'll want to install the package
- [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data),
- which now includes both the lemmatization tables (as in v2.2) and the
- normalization tables (new in v2.3). If you're using pretrained models,
- **nothing changes**, because the relevant tables are included in the model
- packages.
+ [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data), which
+ now includes both the lemmatization tables (as in v2.2) and the normalization
+ tables (new in v2.3). If you're using pretrained models, **nothing changes**,
+ because the relevant tables are included in the model packages.
- Due to the updated Universal Dependencies training data, the fine-grained
part-of-speech tags will change for many provided language models. The
coarse-grained part-of-speech tagset remains the same, but the mapping from
particular fine-grained to coarse-grained tags may show minor differences.
- For French, Italian, Portuguese and Spanish, the fine-grained part-of-speech
- tagsets contain new merged tags related to contracted forms, such as
- `ADP_DET` for French `"au"`, which maps to UPOS `ADP` based on the head
- `"à"`. This increases the accuracy of the models by improving the alignment
- between spaCy's tokenization and Universal Dependencies multi-word tokens
- used for contractions.
+ tagsets contain new merged tags related to contracted forms, such as `ADP_DET`
+ for French `"au"`, which maps to UPOS `ADP` based on the head `"à"`. This
+ increases the accuracy of the models by improving the alignment between
+ spaCy's tokenization and Universal Dependencies multi-word tokens used for
+ contractions.
### Migrating from spaCy 2.2 {#migrating}
@@ -143,29 +151,95 @@ v2.3 so that `token_match` has priority over prefixes and suffixes as in v2.2.1
and earlier versions.
A new tokenizer setting `url_match` has been introduced in v2.3.0 to handle
-cases like URLs where the tokenizer should remove prefixes and suffixes (e.g.,
-a comma at the end of a URL) before applying the match. See the full [tokenizer
-documentation](/usage/linguistic-features#tokenization) and try out
+cases like URLs where the tokenizer should remove prefixes and suffixes (e.g., a
+comma at the end of a URL) before applying the match. See the full
+[tokenizer documentation](/usage/linguistic-features#tokenization) and try out
[`nlp.tokenizer.explain()`](/usage/linguistic-features#tokenizer-debug) when
debugging your tokenizer configuration.
#### Warnings configuration
-spaCy's custom warnings have been replaced with native python
+spaCy's custom warnings have been replaced with native Python
[`warnings`](https://docs.python.org/3/library/warnings.html). Instead of
-setting `SPACY_WARNING_IGNORE`, use the [warnings
+setting `SPACY_WARNING_IGNORE`, use the [`warnings`
filters](https://docs.python.org/3/library/warnings.html#the-warnings-filter)
to manage warnings.
+```diff
+import spacy
++ import warnings
+
+- spacy.errors.SPACY_WARNING_IGNORE.append('W007')
++ warnings.filterwarnings("ignore", message=r"\\[W007\\]", category=UserWarning)
+```
+
#### Normalization tables
The normalization tables have moved from the language data in
-[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to
-the package
-[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). If
-you're adding data for a new language, the normalization table should be added
-to `spacy-lookups-data`. See [adding norm
-exceptions](/usage/adding-languages#norm-exceptions).
+[`spacy/lang`](https://github.com/explosion/spaCy/tree/master/spacy/lang) to the
+package [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data).
+If you're adding data for a new language, the normalization table should be
+added to `spacy-lookups-data`. See
+[adding norm exceptions](/usage/adding-languages#norm-exceptions).
+
+#### No preloaded vocab for models with vectors
+
+To reduce the initial loading time, the lexemes in `nlp.vocab` are no longer
+loaded on initialization for models with vectors. As you process texts, the
+lexemes will be added to the vocab automatically, just as in small models
+without vectors.
+
+To see the number of unique vectors and number of words with vectors, see
+`nlp.meta['vectors']`, for example for `en_core_web_md` there are `20000`
+unique vectors and `684830` words with vectors:
+
+```python
+{
+ 'width': 300,
+ 'vectors': 20000,
+ 'keys': 684830,
+ 'name': 'en_core_web_md.vectors'
+}
+```
+
+If required, for instance if you are working directly with word vectors rather
+than processing texts, you can load all lexemes for words with vectors at once:
+
+```python
+for orth in nlp.vocab.vectors:
+ _ = nlp.vocab[orth]
+```
+
+If your workflow previously iterated over `nlp.vocab`, a similar alternative
+is to iterate over words with vectors instead:
+
+```diff
+- lexemes = [w for w in nlp.vocab]
++ lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+```
+
+Be aware that the set of preloaded lexemes in a v2.2 model is not equivalent to
+the set of words with vectors. For English, v2.2 `md/lg` models have 1.3M
+provided lexemes but only 685K words with vectors. The vectors have been
+updated for most languages in v2.2, but the English models contain the same
+vectors for both v2.2 and v2.3.
+
+#### Lexeme.is_oov and Token.is_oov
+
+
+
+Due to a bug, the values for `is_oov` are reversed in v2.3.0, but this will be
+fixed in the next patch release v2.3.1.
+
+
+
+In v2.3, `Lexeme.is_oov` and `Token.is_oov` are `True` if the lexeme does not
+have a word vector. This is equivalent to `token.orth not in
+nlp.vocab.vectors`.
+
+Previously in v2.2, `is_oov` corresponded to whether a lexeme had stored
+probability and cluster features. The probability and cluster features are no
+longer included in the provided medium and large models (see the next section).
#### Probability and cluster features
@@ -181,28 +255,50 @@ exceptions](/usage/adding-languages#norm-exceptions).
The `Token.prob` and `Token.cluster` features, which are no longer used by the
core pipeline components as of spaCy v2, are no longer provided in the
-pretrained models to reduce the model size. To keep these features available
-for users relying on them, the `prob` and `cluster` features for the most
-frequent 1M tokens have been moved to
+pretrained models to reduce the model size. To keep these features available for
+users relying on them, the `prob` and `cluster` features for the most frequent
+1M tokens have been moved to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) as
`extra` features for the relevant languages (English, German, Greek and
Spanish).
The extra tables are loaded lazily, so if you have `spacy-lookups-data`
-installed and your code accesses `Token.prob`, the full table is loaded into
-the model vocab, which will take a few seconds on initial loading. When you
-save this model after loading the `prob` table, the full `prob` table will be
-saved as part of the model vocab.
+installed and your code accesses `Token.prob`, the full table is loaded into the
+model vocab, which will take a few seconds on initial loading. When you save
+this model after loading the `prob` table, the full `prob` table will be saved
+as part of the model vocab.
-If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as
-part of a new model, add the data to
+To load the probability table into a provided model, first make sure you have
+`spacy-lookups-data` installed. To load the table, remove the empty provided
+`lexeme_prob` table and then access `Lexeme.prob` for any word to load the
+table from `spacy-lookups-data`:
+
+```diff
++ # prerequisite: pip install spacy-lookups-data
+import spacy
+
+nlp = spacy.load("en_core_web_md")
+
+# remove the empty placeholder prob table
++ if nlp.vocab.lookups_extra.has_table("lexeme_prob"):
++ nlp.vocab.lookups_extra.remove_table("lexeme_prob")
+
+# access any `.prob` to load the full table into the model
+assert nlp.vocab["a"].prob == -3.9297883511
+
+# if desired, save this model with the probability table included
+nlp.to_disk("/path/to/model")
+```
+
+If you'd like to include custom `cluster`, `prob`, or `sentiment` tables as part
+of a new model, add the data to
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data) under
the entry point `lg_extra`, e.g. `en_extra` for English. Alternatively, you can
initialize your [`Vocab`](/api/vocab) with the `lookups_extra` argument with a
[`Lookups`](/api/lookups) object that includes the tables `lexeme_cluster`,
`lexeme_prob`, `lexeme_sentiment` or `lexeme_settings`. `lexeme_settings` is
-currently only used to provide a custom `oov_prob`. See examples in the [`data`
-directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
+currently only used to provide a custom `oov_prob`. See examples in the
+[`data` directory](https://github.com/explosion/spacy-lookups-data/tree/master/spacy_lookups_data/data)
in `spacy-lookups-data`.
#### Initializing new models without extra lookups tables
@@ -211,3 +307,39 @@ When you initialize a new model with [`spacy init-model`](/api/cli#init-model),
the `prob` table from `spacy-lookups-data` may be loaded as part of the
initialization. If you'd like to omit this extra data as in spaCy's provided
v2.3 models, use the new flag `--omit-extra-lookups`.
+
+#### Tag maps in provided models vs. blank models
+
+The tag maps in the provided models may differ from the tag maps in the spaCy
+library. You can access the tag map in a loaded model under
+`nlp.vocab.morphology.tag_map`.
+
+The tag map from `spacy.lang.lg.tag_map` is still used when a blank model is
+initialized. If you want to provide an alternate tag map, update
+`nlp.vocab.morphology.tag_map` after initializing the model or if you're using
+the [train CLI](/api/cli#train), you can use the new `--tag-map-path` option to
+provide in the tag map as a JSON dict.
+
+If you want to export a tag map from a provided model for use with the train
+CLI, you can save it as a JSON dict. To only use string keys as required by
+JSON and to make it easier to read and edit, any internal integer IDs need to
+be converted back to strings:
+
+```python
+import spacy
+import srsly
+
+nlp = spacy.load("en_core_web_sm")
+tag_map = {}
+
+# convert any integer IDs to strings for JSON
+for tag, morph in nlp.vocab.morphology.tag_map.items():
+ tag_map[tag] = {}
+ for feat, val in morph.items():
+ feat = nlp.vocab.strings.as_string(feat)
+ if not isinstance(val, bool):
+ val = nlp.vocab.strings.as_string(val)
+ tag_map[tag][feat] = val
+
+srsly.write_json("tag_map.json", tag_map)
+```
diff --git a/website/meta/site.json b/website/meta/site.json
index 29d71048e..8b8424f82 100644
--- a/website/meta/site.json
+++ b/website/meta/site.json
@@ -23,9 +23,9 @@
"apiKey": "371e26ed49d29a27bd36273dfdaf89af",
"indexName": "spacy"
},
- "binderUrl": "ines/spacy-io-binder",
+ "binderUrl": "explosion/spacy-io-binder",
"binderBranch": "live",
- "binderVersion": "2.2.0",
+ "binderVersion": "2.3.0",
"sections": [
{ "id": "usage", "title": "Usage Documentation", "theme": "blue" },
{ "id": "models", "title": "Models Documentation", "theme": "blue" },