diff --git a/.github/ISSUE_TEMPLATE/01_bugs.md b/.github/ISSUE_TEMPLATE/01_bugs.md
index 768832c24..255a5241e 100644
--- a/.github/ISSUE_TEMPLATE/01_bugs.md
+++ b/.github/ISSUE_TEMPLATE/01_bugs.md
@@ -4,6 +4,8 @@ about: Use this template if you came across a bug or unexpected behaviour differ
---
+
+
## How to reproduce the behaviour
diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml
index fce1a1064..31f89f917 100644
--- a/.github/ISSUE_TEMPLATE/config.yml
+++ b/.github/ISSUE_TEMPLATE/config.yml
@@ -1,8 +1,5 @@
blank_issues_enabled: false
contact_links:
- - name: โ ๏ธ Python 3.10 Support
- url: https://github.com/explosion/spaCy/discussions/9418
- about: Python 3.10 wheels haven't been released yet, see the link for details.
- name: ๐ฏ Discussions Forum
url: https://github.com/explosion/spaCy/discussions
about: Install issues, usage questions, general discussion and anything else that isn't a bug report.
diff --git a/.github/contributors/fonfonx.md b/.github/contributors/fonfonx.md
new file mode 100644
index 000000000..7fb01ca5a
--- /dev/null
+++ b/.github/contributors/fonfonx.md
@@ -0,0 +1,106 @@
+# spaCy contributor agreement
+
+This spaCy Contributor Agreement (**"SCA"**) is based on the
+[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
+The SCA applies to any contribution that you make to any product or project
+managed by us (the **"project"**), and sets out the intellectual property rights
+you grant to us in the contributed materials. The term **"us"** shall mean
+[ExplosionAI GmbH](https://explosion.ai/legal). The term
+**"you"** shall mean the person or entity identified below.
+
+If you agree to be bound by these terms, fill in the information requested
+below and include the filled-in version with your first pull request, under the
+folder [`.github/contributors/`](/.github/contributors/). The name of the file
+should be your GitHub username, with the extension `.md`. For example, the user
+example_user would create the file `.github/contributors/example_user.md`.
+
+Read this agreement carefully before signing. These terms and conditions
+constitute a binding legal agreement.
+
+## Contributor Agreement
+
+1. The term "contribution" or "contributed materials" means any source code,
+object code, patch, tool, sample, graphic, specification, manual,
+documentation, or any other material posted or submitted by you to the project.
+
+2. With respect to any worldwide copyrights, or copyright applications and
+registrations, in your contribution:
+
+ * you hereby assign to us joint ownership, and to the extent that such
+ assignment is or becomes invalid, ineffective or unenforceable, you hereby
+ grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
+ royalty-free, unrestricted license to exercise all rights under those
+ copyrights. This includes, at our option, the right to sublicense these same
+ rights to third parties through multiple levels of sublicensees or other
+ licensing arrangements;
+
+ * you agree that each of us can do all things in relation to your
+ contribution as if each of us were the sole owners, and if one of us makes
+ a derivative work of your contribution, the one who makes the derivative
+ work (or has it made will be the sole owner of that derivative work;
+
+ * you agree that you will not assert any moral rights in your contribution
+ against us, our licensees or transferees;
+
+ * you agree that we may register a copyright in your contribution and
+ exercise all ownership rights associated with it; and
+
+ * you agree that neither of us has any duty to consult with, obtain the
+ consent of, pay or render an accounting to the other for any use or
+ distribution of your contribution.
+
+3. With respect to any patents you own, or that you can license without payment
+to any third party, you hereby grant to us a perpetual, irrevocable,
+non-exclusive, worldwide, no-charge, royalty-free license to:
+
+ * make, have made, use, sell, offer to sell, import, and otherwise transfer
+ your contribution in whole or in part, alone or in combination with or
+ included in any product, work or materials arising out of the project to
+ which your contribution was submitted, and
+
+ * at our option, to sublicense these same rights to third parties through
+ multiple levels of sublicensees or other licensing arrangements.
+
+4. Except as set out above, you keep all right, title, and interest in your
+contribution. The rights that you grant to us under these terms are effective
+on the date you first submitted a contribution to us, even if your submission
+took place before the date you sign these terms.
+
+5. You covenant, represent, warrant and agree that:
+
+ * Each contribution that you submit is and shall be an original work of
+ authorship and you can legally grant the rights set out in this SCA;
+
+ * to the best of your knowledge, each contribution will not violate any
+ third party's copyrights, trademarks, patents, or other intellectual
+ property rights; and
+
+ * each contribution shall be in compliance with U.S. export control laws and
+ other applicable export and import laws. You agree to notify us if you
+ become aware of any circumstance which would make any of the foregoing
+ representations inaccurate in any respect. We may publicly disclose your
+ participation in the project, including the fact that you have signed the SCA.
+
+6. This SCA is governed by the laws of the State of California and applicable
+U.S. Federal law. Any choice of law rules will not apply.
+
+7. Please place an โxโ on one of the applicable statement below. Please do NOT
+mark both statements:
+
+ * [x] I am signing on behalf of myself as an individual and no other person
+ or entity, including my employer, has or will have rights with respect to my
+ contributions.
+
+ * [ ] I am signing on behalf of my employer or a legal entity and I have the
+ actual authority to contractually bind that entity.
+
+## Contributor Details
+
+| Field | Entry |
+|------------------------------- | -------------------- |
+| Name | Xavier Fontaine |
+| Company name (if applicable) | |
+| Title or role (if applicable) | |
+| Date | 2022-04-13 |
+| GitHub username | fonfonx |
+| Website (optional) | |
diff --git a/.github/workflows/gputests.yml b/.github/workflows/gputests.yml
new file mode 100644
index 000000000..bb7f51d29
--- /dev/null
+++ b/.github/workflows/gputests.yml
@@ -0,0 +1,21 @@
+name: Weekly GPU tests
+
+on:
+ schedule:
+ - cron: '0 1 * * MON'
+
+jobs:
+ weekly-gputests:
+ strategy:
+ fail-fast: false
+ matrix:
+ branch: [master, v4]
+ runs-on: ubuntu-latest
+ steps:
+ - name: Trigger buildkite build
+ uses: buildkite/trigger-pipeline-action@v1.2.0
+ env:
+ PIPELINE: explosion-ai/spacy-slow-gpu-tests
+ BRANCH: ${{ matrix.branch }}
+ MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
+ BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
diff --git a/.github/workflows/slowtests.yml b/.github/workflows/slowtests.yml
new file mode 100644
index 000000000..1a99c751c
--- /dev/null
+++ b/.github/workflows/slowtests.yml
@@ -0,0 +1,37 @@
+name: Daily slow tests
+
+on:
+ schedule:
+ - cron: '0 0 * * *'
+
+jobs:
+ daily-slowtests:
+ strategy:
+ fail-fast: false
+ matrix:
+ branch: [master, v4]
+ runs-on: ubuntu-latest
+ steps:
+ - name: Checkout
+ uses: actions/checkout@v1
+ with:
+ ref: ${{ matrix.branch }}
+ - name: Get commits from past 24 hours
+ id: check_commits
+ run: |
+ today=$(date '+%Y-%m-%d %H:%M:%S')
+ yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
+ if git log --after="$yesterday" --before="$today" | grep commit ; then
+ echo "::set-output name=run_tests::true"
+ else
+ echo "::set-output name=run_tests::false"
+ fi
+
+ - name: Trigger buildkite build
+ if: steps.check_commits.outputs.run_tests == 'true'
+ uses: buildkite/trigger-pipeline-action@v1.2.0
+ env:
+ PIPELINE: explosion-ai/spacy-slow-tests
+ BRANCH: ${{ matrix.branch }}
+ MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
+ BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
diff --git a/.gitignore b/.gitignore
index 60036a475..ac72f2bbf 100644
--- a/.gitignore
+++ b/.gitignore
@@ -9,7 +9,6 @@ keys/
spacy/tests/package/setup.cfg
spacy/tests/package/pyproject.toml
spacy/tests/package/requirements.txt
-spacy/tests/universe/universe.json
# Website
website/.cache/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index a7a12fd24..b959262e3 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -1,9 +1,10 @@
repos:
- repo: https://github.com/ambv/black
- rev: 21.6b0
+ rev: 22.3.0
hooks:
- id: black
language_version: python3.7
+ additional_dependencies: ['click==8.0.4']
- repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2
hooks:
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 9a7d0744a..ddd833be1 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -144,7 +144,7 @@ Changes to `.py` files will be effective immediately.
When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already
-exist. The description text can be very short โ we don't want to make this too
+exist. The description text can be very short โ we don't want to make this too
bureaucratic.
Next, add a test to the relevant file in the
@@ -233,7 +233,7 @@ also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings.
-The [`.flake8`](.flake8) config defines the configuration we use for this
+The `flake8` section in [`setup.cfg`](setup.cfg) defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables.
diff --git a/README.md b/README.md
index 57d76fb45..05c912ffa 100644
--- a/README.md
+++ b/README.md
@@ -32,19 +32,20 @@ open-source software, released under the MIT license.
## ๐ Documentation
-| Documentation | |
-| -------------------------- | -------------------------------------------------------------- |
-| โญ๏ธ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
-| ๐ **[Usage Guides]** | How to use spaCy and its features. |
-| ๐ **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
-| ๐ช **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
-| ๐ **[API Reference]** | The detailed reference for spaCy's API. |
-| ๐ฆ **[Models]** | Download trained pipelines for spaCy. |
-| ๐ **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
-| ๐ฉโ๐ซ **[Online Course]** | Learn spaCy in this free and interactive online course. |
-| ๐บ **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
-| ๐ **[Changelog]** | Changes and version history. |
-| ๐ **[Contribute]** | How to contribute to the spaCy project and code base. |
+| Documentation | |
+| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| โญ๏ธ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
+| ๐ **[Usage Guides]** | How to use spaCy and its features. |
+| ๐ **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
+| ๐ช **[Project Templates]** | End-to-end workflows you can clone, modify and run. |
+| ๐ **[API Reference]** | The detailed reference for spaCy's API. |
+| ๐ฆ **[Models]** | Download trained pipelines for spaCy. |
+| ๐ **[Universe]** | Plugins, extensions, demos and books from the spaCy ecosystem. |
+| ๐ฉโ๐ซ **[Online Course]** | Learn spaCy in this free and interactive online course. |
+| ๐บ **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
+| ๐ **[Changelog]** | Changes and version history. |
+| ๐ **[Contribute]** | How to contribute to the spaCy project and code base. |
+| | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
[spacy 101]: https://spacy.io/usage/spacy-101
[new in v3.0]: https://spacy.io/usage/v3
@@ -60,9 +61,7 @@ open-source software, released under the MIT license.
## ๐ฌ Where to ask questions
-The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**,
-**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
-**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
+The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
Please understand that we won't be able to provide individual support via email.
We also believe that help is much more valuable if it's shared publicly, so that
more people can benefit from it.
diff --git a/azure-pipelines.yml b/azure-pipelines.yml
index 71a793911..4624b2eb2 100644
--- a/azure-pipelines.yml
+++ b/azure-pipelines.yml
@@ -11,12 +11,14 @@ trigger:
exclude:
- "website/*"
- "*.md"
+ - ".github/workflows/*"
pr:
- paths:
+ paths:
exclude:
- "*.md"
- "website/docs/*"
- "website/src/*"
+ - ".github/workflows/*"
jobs:
# Perform basic checks for most important errors (syntax etc.) Uses the config
diff --git a/extra/DEVELOPER_DOCS/Code Conventions.md b/extra/DEVELOPER_DOCS/Code Conventions.md
index eba466c46..37cd8ff27 100644
--- a/extra/DEVELOPER_DOCS/Code Conventions.md
+++ b/extra/DEVELOPER_DOCS/Code Conventions.md
@@ -137,7 +137,7 @@ If any of the TODOs you've added are important and should be fixed soon, you sho
## Type hints
-We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation.
+We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation. Ideally when developing, run `mypy spacy` on the code base to inspect any issues.
If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` โ although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
@@ -155,6 +155,13 @@ def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
return callback
```
+For typing variables, we prefer the explicit format.
+
+```diff
+- var = value # type: Type
++ var: Type = value
+```
+
For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
```python
diff --git a/extra/DEVELOPER_DOCS/ExplosionBot.md b/extra/DEVELOPER_DOCS/ExplosionBot.md
new file mode 100644
index 000000000..eebec1a06
--- /dev/null
+++ b/extra/DEVELOPER_DOCS/ExplosionBot.md
@@ -0,0 +1,36 @@
+# Explosion-bot
+
+Explosion-bot is a robot that can be invoked to help with running particular test commands.
+
+## Permissions
+
+Only maintainers have permissions to summon explosion-bot. Each of the open source repos that use explosion-bot has its own team(s) of maintainers, and only github users who are members of those teams can successfully run bot commands.
+
+## Running robot commands
+
+To summon the robot, write a github comment on the issue/PR you wish to test. The comment must be in the following format:
+
+```
+@explosion-bot please test_gpu
+```
+
+Some things to note:
+
+* The `@explosion-bot please` must be the beginning of the command - you cannot add anything in front of this or else the robot won't know how to parse it. Adding anything at the end aside from the test name will also confuse the robot, so keep it simple!
+* The command name (such as `test_gpu`) must be one of the tests that the bot knows how to run. The available commands are documented in the bot's [workflow config](https://github.com/explosion/spaCy/blob/master/.github/workflows/explosionbot.yml#L26) and must match exactly one of the commands listed there.
+* The robot can't do multiple things at once, so if you want it to run multiple tests, you'll have to summon it with one comment per test.
+* For the `test_gpu` command, you can specify an optional thinc branch (from the spaCy repo) or a spaCy branch (from the thinc repo) with either the `--thinc-branch` or `--spacy-branch` flags. By default, the bot will pull in the PR branch from the repo where the command was issued, and the main branch of the other repository. However, if you need to run against another branch, you can say (for example):
+
+```
+@explosion-bot please test_gpu --thinc-branch develop
+```
+You can also specify a branch from an unmerged PR:
+```
+@explosion-bot please test_gpu --thinc-branch refs/pull/633/head
+```
+
+## Troubleshooting
+
+If the robot isn't responding to commands as expected, you can check its logs in the [Github Action](https://github.com/explosion/spaCy/actions/workflows/explosionbot.yml).
+
+For each command sent to the bot, there should be a run of the `explosion-bot` workflow. In the `Install and run explosion-bot` step, towards the ends of the logs you should see info about the configuration that the bot was run with, as well as any errors that the bot encountered.
diff --git a/pyproject.toml b/pyproject.toml
index f81484d43..a43b4c814 100644
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0",
- "thinc>=8.0.12,<8.1.0",
+ "thinc>=8.0.14,<8.1.0",
"blis>=0.4.0,<0.8.0",
"pathy",
"numpy>=1.15.0",
diff --git a/requirements.txt b/requirements.txt
index 8d7372cfe..619d35ebc 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -1,14 +1,14 @@
# Our libraries
-spacy-legacy>=3.0.8,<3.1.0
+spacy-legacy>=3.0.9,<3.1.0
spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
-thinc>=8.0.12,<8.1.0
+thinc>=8.0.14,<8.1.0
blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0
-wasabi>=0.8.1,<1.1.0
-srsly>=2.4.1,<3.0.0
+wasabi>=0.9.1,<1.1.0
+srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0
pathy>=0.3.5
@@ -26,7 +26,7 @@ typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
# Development dependencies
pre-commit>=2.13.0
cython>=0.25,<3.0
-pytest>=5.2.0
+pytest>=5.2.0,!=7.1.0
pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0
flake8>=3.8.0,<3.10.0
@@ -35,3 +35,4 @@ mypy==0.910
types-dataclasses>=0.1.3; python_version < "3.7"
types-mock>=0.1.1
types-requests
+black>=22.0,<23.0
diff --git a/setup.cfg b/setup.cfg
index 586a044ff..2626de87e 100644
--- a/setup.cfg
+++ b/setup.cfg
@@ -38,18 +38,18 @@ setup_requires =
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0
- thinc>=8.0.12,<8.1.0
+ thinc>=8.0.14,<8.1.0
install_requires =
# Our libraries
- spacy-legacy>=3.0.8,<3.1.0
+ spacy-legacy>=3.0.9,<3.1.0
spacy-loggers>=1.0.0,<2.0.0
murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0
- thinc>=8.0.12,<8.1.0
+ thinc>=8.0.14,<8.1.0
blis>=0.4.0,<0.8.0
- wasabi>=0.8.1,<1.1.0
- srsly>=2.4.1,<3.0.0
+ wasabi>=0.9.1,<1.1.0
+ srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0
pathy>=0.3.5
diff --git a/setup.py b/setup.py
index 03a1e01dd..9023b9fa3 100755
--- a/setup.py
+++ b/setup.py
@@ -23,6 +23,7 @@ Options.docstrings = True
PACKAGES = find_packages()
MOD_NAMES = [
+ "spacy.training.alignment_array",
"spacy.training.example",
"spacy.parts_of_speech",
"spacy.strings",
@@ -33,6 +34,7 @@ MOD_NAMES = [
"spacy.ml.parser_model",
"spacy.morphology",
"spacy.pipeline.dep_parser",
+ "spacy.pipeline._edit_tree_internals.edit_trees",
"spacy.pipeline.morphologizer",
"spacy.pipeline.multitask",
"spacy.pipeline.ner",
@@ -81,7 +83,6 @@ COPY_FILES = {
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
- ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
}
diff --git a/spacy/about.py b/spacy/about.py
index c253d5052..03eabc2e9 100644
--- a/spacy/about.py
+++ b/spacy/about.py
@@ -1,6 +1,6 @@
# fmt: off
__title__ = "spacy"
-__version__ = "3.2.1"
+__version__ = "3.3.0"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects"
diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py
index fd8da262e..ce76ef9a9 100644
--- a/spacy/cli/__init__.py
+++ b/spacy/cli/__init__.py
@@ -14,6 +14,7 @@ from .pretrain import pretrain # noqa: F401
from .debug_data import debug_data # noqa: F401
from .debug_config import debug_config # noqa: F401
from .debug_model import debug_model # noqa: F401
+from .debug_diff import debug_diff # noqa: F401
from .evaluate import evaluate # noqa: F401
from .convert import convert # noqa: F401
from .init_pipeline import init_pipeline_cli # noqa: F401
diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py
index fb680d888..df98e711f 100644
--- a/spacy/cli/_util.py
+++ b/spacy/cli/_util.py
@@ -360,7 +360,7 @@ def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False)
src = str(src)
with smart_open.open(src, mode="rb", ignore_ext=True) as input_file:
with dest.open(mode="wb") as output_file:
- output_file.write(input_file.read())
+ shutil.copyfileobj(input_file, output_file)
def ensure_pathy(path):
diff --git a/spacy/cli/debug_data.py b/spacy/cli/debug_data.py
index ab7c20d48..f94319d1d 100644
--- a/spacy/cli/debug_data.py
+++ b/spacy/cli/debug_data.py
@@ -19,6 +19,7 @@ from ..morphology import Morphology
from ..language import Language
from ..util import registry, resolve_dot_names
from ..compat import Literal
+from ..vectors import Mode as VectorsMode
from .. import util
@@ -170,29 +171,101 @@ def debug_data(
show=verbose,
)
if len(nlp.vocab.vectors):
- msg.info(
- f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
- f"unique keys, {nlp.vocab.vectors_length} dimensions)"
- )
- n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
- msg.warn(
- "{} words in training data without vectors ({:.0f}%)".format(
- n_missing_vectors,
- 100 * (n_missing_vectors / gold_train_data["n_words"]),
- ),
- )
- msg.text(
- "10 most common words without vectors: {}".format(
- _format_labels(
- gold_train_data["words_missing_vectors"].most_common(10),
- counts=True,
- )
- ),
- show=verbose,
- )
+ if nlp.vocab.vectors.mode == VectorsMode.floret:
+ msg.info(
+ f"floret vectors with {len(nlp.vocab.vectors)} vectors, "
+ f"{nlp.vocab.vectors_length} dimensions, "
+ f"{nlp.vocab.vectors.minn}-{nlp.vocab.vectors.maxn} char "
+ f"n-gram subwords"
+ )
+ else:
+ msg.info(
+ f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
+ f"unique keys, {nlp.vocab.vectors_length} dimensions)"
+ )
+ n_missing_vectors = sum(gold_train_data["words_missing_vectors"].values())
+ msg.warn(
+ "{} words in training data without vectors ({:.0f}%)".format(
+ n_missing_vectors,
+ 100 * (n_missing_vectors / gold_train_data["n_words"]),
+ ),
+ )
+ msg.text(
+ "10 most common words without vectors: {}".format(
+ _format_labels(
+ gold_train_data["words_missing_vectors"].most_common(10),
+ counts=True,
+ )
+ ),
+ show=verbose,
+ )
else:
msg.info("No word vectors present in the package")
+ if "spancat" in factory_names:
+ model_labels_spancat = _get_labels_from_spancat(nlp)
+ has_low_data_warning = False
+ has_no_neg_warning = False
+
+ msg.divider("Span Categorization")
+ msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
+
+ msg.text("Label counts in train data: ", show=verbose)
+ for spans_key, data_labels in gold_train_data["spancat"].items():
+ msg.text(
+ f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
+ show=verbose,
+ )
+ # Data checks: only take the spans keys in the actual spancat components
+ data_labels_in_component = {
+ spans_key: gold_train_data["spancat"][spans_key]
+ for spans_key in model_labels_spancat.keys()
+ }
+ for spans_key, data_labels in data_labels_in_component.items():
+ for label, count in data_labels.items():
+ # Check for missing labels
+ spans_key_in_model = spans_key in model_labels_spancat.keys()
+ if (spans_key_in_model) and (
+ label not in model_labels_spancat[spans_key]
+ ):
+ msg.warn(
+ f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
+ "Performance may degrade after training."
+ )
+ # Check for low number of examples per label
+ if count <= NEW_LABEL_THRESHOLD:
+ msg.warn(
+ f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
+ )
+ has_low_data_warning = True
+ # Check for negative examples
+ with msg.loading("Analyzing label distribution..."):
+ neg_docs = _get_examples_without_label(
+ train_dataset, label, "spancat", spans_key
+ )
+ if neg_docs == 0:
+ msg.warn(f"No examples for texts WITHOUT new label '{label}'")
+ has_no_neg_warning = True
+
+ if has_low_data_warning:
+ msg.text(
+ f"To train a new span type, your data should include at "
+ f"least {NEW_LABEL_THRESHOLD} instances of the new label",
+ show=verbose,
+ )
+ else:
+ msg.good("Good amount of examples for all labels")
+
+ if has_no_neg_warning:
+ msg.text(
+ "Training data should always include examples of spans "
+ "in context, as well as examples without a given span "
+ "type.",
+ show=verbose,
+ )
+ else:
+ msg.good("Examples without ocurrences available for all labels")
+
if "ner" in factory_names:
# Get all unique NER labels present in the data
labels = set(
@@ -238,7 +311,7 @@ def debug_data(
has_low_data_warning = True
with msg.loading("Analyzing label distribution..."):
- neg_docs = _get_examples_without_label(train_dataset, label)
+ neg_docs = _get_examples_without_label(train_dataset, label, "ner")
if neg_docs == 0:
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True
@@ -573,6 +646,7 @@ def _compile_gold(
"deps": Counter(),
"words": Counter(),
"roots": Counter(),
+ "spancat": dict(),
"ws_ents": 0,
"boundary_cross_ents": 0,
"n_words": 0,
@@ -603,6 +677,7 @@ def _compile_gold(
if nlp.vocab.strings[word] not in nlp.vocab.vectors:
data["words_missing_vectors"].update([word])
if "ner" in factory_names:
+ sent_starts = eg.get_aligned_sent_starts()
for i, label in enumerate(eg.get_aligned_ner()):
if label is None:
continue
@@ -612,10 +687,19 @@ def _compile_gold(
if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1]
data["ner"][combined_label] += 1
- if gold[i].is_sent_start and label.startswith(("I-", "L-")):
+ if sent_starts[i] == True and label.startswith(("I-", "L-")):
data["boundary_cross_ents"] += 1
elif label == "-":
data["ner"]["-"] += 1
+ if "spancat" in factory_names:
+ for span_key in list(eg.reference.spans.keys()):
+ if span_key not in data["spancat"]:
+ data["spancat"][span_key] = Counter()
+ for i, span in enumerate(eg.reference.spans[span_key]):
+ if span.label_ is None:
+ continue
+ else:
+ data["spancat"][span_key][span.label_] += 1
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
data["cats"].update(gold.cats)
if any(val not in (0, 1) for val in gold.cats.values()):
@@ -686,14 +770,28 @@ def _format_labels(
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
-def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
+def _get_examples_without_label(
+ data: Sequence[Example],
+ label: str,
+ component: Literal["ner", "spancat"] = "ner",
+ spans_key: Optional[str] = "sc",
+) -> int:
count = 0
for eg in data:
- labels = [
- label.split("-")[1]
- for label in eg.get_aligned_ner()
- if label not in ("O", "-", None)
- ]
+ if component == "ner":
+ labels = [
+ label.split("-")[1]
+ for label in eg.get_aligned_ner()
+ if label not in ("O", "-", None)
+ ]
+
+ if component == "spancat":
+ labels = (
+ [span.label_ for span in eg.reference.spans[spans_key]]
+ if spans_key in eg.reference.spans
+ else []
+ )
+
if label not in labels:
count += 1
return count
diff --git a/spacy/cli/debug_diff.py b/spacy/cli/debug_diff.py
new file mode 100644
index 000000000..6697c38ae
--- /dev/null
+++ b/spacy/cli/debug_diff.py
@@ -0,0 +1,89 @@
+from typing import Optional
+
+import typer
+from wasabi import Printer, diff_strings, MarkdownRenderer
+from pathlib import Path
+from thinc.api import Config
+
+from ._util import debug_cli, Arg, Opt, show_validation_error, parse_config_overrides
+from ..util import load_config
+from .init_config import init_config, Optimizations
+
+
+@debug_cli.command(
+ "diff-config",
+ context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
+)
+def debug_diff_cli(
+ # fmt: off
+ ctx: typer.Context,
+ config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
+ compare_to: Optional[Path] = Opt(None, help="Path to a config file to diff against, or `None` to compare against default settings", exists=True, allow_dash=True),
+ optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether the user config was optimized for efficiency or accuracy. Only relevant when comparing against the default config."),
+ gpu: bool = Opt(False, "--gpu", "-G", help="Whether the original config can run on a GPU. Only relevant when comparing against the default config."),
+ pretraining: bool = Opt(False, "--pretraining", "--pt", help="Whether to compare on a config with pretraining involved. Only relevant when comparing against the default config."),
+ markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues")
+ # fmt: on
+):
+ """Show a diff of a config file with respect to spaCy's defaults or another config file. If
+ additional settings were used in the creation of the config file, then you
+ must supply these as extra parameters to the command when comparing to the default settings. The generated diff
+ can also be used when posting to the discussion forum to provide more
+ information for the maintainers.
+
+ The `optimize`, `gpu`, and `pretraining` options are only relevant when
+ comparing against the default configuration (or specifically when `compare_to` is None).
+
+ DOCS: https://spacy.io/api/cli#debug-diff
+ """
+ debug_diff(
+ config_path=config_path,
+ compare_to=compare_to,
+ gpu=gpu,
+ optimize=optimize,
+ pretraining=pretraining,
+ markdown=markdown,
+ )
+
+
+def debug_diff(
+ config_path: Path,
+ compare_to: Optional[Path],
+ gpu: bool,
+ optimize: Optimizations,
+ pretraining: bool,
+ markdown: bool,
+):
+ msg = Printer()
+ with show_validation_error(hint_fill=False):
+ user_config = load_config(config_path)
+ if compare_to:
+ other_config = load_config(compare_to)
+ else:
+ # Recreate a default config based from user's config
+ lang = user_config["nlp"]["lang"]
+ pipeline = list(user_config["nlp"]["pipeline"])
+ msg.info(f"Found user-defined language: '{lang}'")
+ msg.info(f"Found user-defined pipelines: {pipeline}")
+ other_config = init_config(
+ lang=lang,
+ pipeline=pipeline,
+ optimize=optimize.value,
+ gpu=gpu,
+ pretraining=pretraining,
+ silent=True,
+ )
+
+ user = user_config.to_str()
+ other = other_config.to_str()
+
+ if user == other:
+ msg.warn("No diff to show: configs are identical")
+ else:
+ diff_text = diff_strings(other, user, add_symbols=markdown)
+ if markdown:
+ md = MarkdownRenderer()
+ md.add(md.code_block(diff_text, "diff"))
+ print(md.text)
+ else:
+ print(diff_text)
diff --git a/spacy/cli/package.py b/spacy/cli/package.py
index f9d2a9af2..b8c8397b6 100644
--- a/spacy/cli/package.py
+++ b/spacy/cli/package.py
@@ -7,6 +7,7 @@ from collections import defaultdict
from catalogue import RegistryError
import srsly
import sys
+import re
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
from ..schemas import validate, ModelMetaSchema
@@ -109,6 +110,24 @@ def package(
", ".join(meta["requirements"]),
)
if name is not None:
+ if not name.isidentifier():
+ msg.fail(
+ f"Model name ('{name}') is not a valid module name. "
+ "This is required so it can be imported as a module.",
+ "We recommend names that use ASCII A-Z, a-z, _ (underscore), "
+ "and 0-9. "
+ "For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
+ exits=1,
+ )
+ if not _is_permitted_package_name(name):
+ msg.fail(
+ f"Model name ('{name}') is not a permitted package name. "
+ "This is required to correctly load the model with spacy.load.",
+ "We recommend names that use ASCII A-Z, a-z, _ (underscore), "
+ "and 0-9. "
+ "For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
+ exits=1,
+ )
meta["name"] = name
if version is not None:
meta["version"] = version
@@ -162,7 +181,7 @@ def package(
imports="\n".join(f"from . import {m}" for m in imports)
)
create_file(package_path / "__init__.py", init_py)
- msg.good(f"Successfully created package '{model_name_v}'", main_path)
+ msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
if create_sdist:
with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
@@ -171,8 +190,14 @@ def package(
if create_wheel:
with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
- wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}"
+ wheel_name_squashed = re.sub("_+", "_", model_name_v)
+ wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
msg.good(f"Successfully created binary wheel", wheel)
+ if "__" in model_name:
+ msg.warn(
+ f"Model name ('{model_name}') contains a run of underscores. "
+ "Runs of underscores are not significant in installed package names.",
+ )
def has_wheel() -> bool:
@@ -422,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
return md.text
+def _is_permitted_package_name(package_name: str) -> bool:
+ # regex from: https://www.python.org/dev/peps/pep-0426/#name
+ permitted_match = re.search(
+ r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
+ )
+ return permitted_match is not None
+
+
TEMPLATE_SETUP = """
#!/usr/bin/env python
import io
diff --git a/spacy/cli/templates/quickstart_training.jinja b/spacy/cli/templates/quickstart_training.jinja
index fb79a4f60..ae11dcafc 100644
--- a/spacy/cli/templates/quickstart_training.jinja
+++ b/spacy/cli/templates/quickstart_training.jinja
@@ -3,6 +3,7 @@ the docs and the init config command. It encodes various best practices and
can help generate the best possible configuration, given a user's requirements. #}
{%- set use_transformer = hardware != "cpu" -%}
{%- set transformer = transformer_data[optimize] if use_transformer else {} -%}
+{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "spancat", "trainable_lemmatizer"] -%}
[paths]
train = null
dev = null
@@ -24,10 +25,10 @@ lang = "{{ lang }}"
{%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%}
{%- set with_accuracy = optimize == "accuracy" -%}
{%- set has_accurate_textcat = has_textcat and with_accuracy -%}
-{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or has_accurate_textcat) -%}
-{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
+{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "spancat" in components or "trainable_lemmatizer" in components or "entity_linker" in components or has_accurate_textcat) -%}
+{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components -%}
{%- else -%}
-{%- set full_pipeline = components %}
+{%- set full_pipeline = components -%}
{%- endif %}
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
batch_size = {{ 128 if hardware == "gpu" else 1000 }}
@@ -54,7 +55,7 @@ stride = 96
factory = "morphologizer"
[components.morphologizer.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.morphologizer.model.tok2vec]
@@ -70,7 +71,7 @@ grad_factor = 1.0
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
@@ -123,6 +124,60 @@ grad_factor = 1.0
@layers = "reduce_mean.v1"
{% endif -%}
+{% if "spancat" in components -%}
+[components.spancat]
+factory = "spancat"
+max_positive = null
+scorer = {"@scorers":"spacy.spancat_scorer.v1"}
+spans_key = "sc"
+threshold = 0.5
+
+[components.spancat.model]
+@architectures = "spacy.SpanCategorizer.v1"
+
+[components.spancat.model.reducer]
+@layers = "spacy.mean_max_reducer.v1"
+hidden_size = 128
+
+[components.spancat.model.scorer]
+@layers = "spacy.LinearLogistic.v1"
+nO = null
+nI = null
+
+[components.spancat.model.tok2vec]
+@architectures = "spacy-transformers.TransformerListener.v1"
+grad_factor = 1.0
+
+[components.spancat.model.tok2vec.pooling]
+@layers = "reduce_mean.v1"
+
+[components.spancat.suggester]
+@misc = "spacy.ngram_suggester.v1"
+sizes = [1,2,3]
+{% endif -%}
+
+{% if "trainable_lemmatizer" in components -%}
+[components.trainable_lemmatizer]
+factory = "trainable_lemmatizer"
+backoff = "orth"
+min_tree_freq = 3
+overwrite = false
+scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
+top_k = 1
+
+[components.trainable_lemmatizer.model]
+@architectures = "spacy.Tagger.v2"
+nO = null
+normalize = false
+
+[components.trainable_lemmatizer.model.tok2vec]
+@architectures = "spacy-transformers.TransformerListener.v1"
+grad_factor = 1.0
+
+[components.trainable_lemmatizer.model.tok2vec.pooling]
+@layers = "reduce_mean.v1"
+{% endif -%}
+
{% if "entity_linker" in components -%}
[components.entity_linker]
factory = "entity_linker"
@@ -131,7 +186,7 @@ incl_context = true
incl_prior = true
[components.entity_linker.model]
-@architectures = "spacy.EntityLinker.v1"
+@architectures = "spacy.EntityLinker.v2"
nO = null
[components.entity_linker.model.tok2vec]
@@ -238,7 +293,7 @@ maxout_pieces = 3
factory = "morphologizer"
[components.morphologizer.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.morphologizer.model.tok2vec]
@@ -251,7 +306,7 @@ width = ${components.tok2vec.model.encode.width}
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
@@ -295,6 +350,54 @@ nO = null
width = ${components.tok2vec.model.encode.width}
{% endif %}
+{% if "spancat" in components %}
+[components.spancat]
+factory = "spancat"
+max_positive = null
+scorer = {"@scorers":"spacy.spancat_scorer.v1"}
+spans_key = "sc"
+threshold = 0.5
+
+[components.spancat.model]
+@architectures = "spacy.SpanCategorizer.v1"
+
+[components.spancat.model.reducer]
+@layers = "spacy.mean_max_reducer.v1"
+hidden_size = 128
+
+[components.spancat.model.scorer]
+@layers = "spacy.LinearLogistic.v1"
+nO = null
+nI = null
+
+[components.spancat.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+
+[components.spancat.suggester]
+@misc = "spacy.ngram_suggester.v1"
+sizes = [1,2,3]
+{% endif %}
+
+{% if "trainable_lemmatizer" in components -%}
+[components.trainable_lemmatizer]
+factory = "trainable_lemmatizer"
+backoff = "orth"
+min_tree_freq = 3
+overwrite = false
+scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
+top_k = 1
+
+[components.trainable_lemmatizer.model]
+@architectures = "spacy.Tagger.v2"
+nO = null
+normalize = false
+
+[components.trainable_lemmatizer.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+{% endif -%}
+
{% if "entity_linker" in components -%}
[components.entity_linker]
factory = "entity_linker"
@@ -303,7 +406,7 @@ incl_context = true
incl_prior = true
[components.entity_linker.model]
-@architectures = "spacy.EntityLinker.v1"
+@architectures = "spacy.EntityLinker.v2"
nO = null
[components.entity_linker.model.tok2vec]
@@ -369,7 +472,7 @@ no_output_layer = false
{% endif %}
{% for pipe in components %}
-{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
+{% if pipe not in listener_components %}
{# Other components defined by the user: we just assume they're factories #}
[components.{{ pipe }}]
factory = "{{ pipe }}"
diff --git a/spacy/displacy/__init__.py b/spacy/displacy/__init__.py
index 25d530c83..5d49b6eb7 100644
--- a/spacy/displacy/__init__.py
+++ b/spacy/displacy/__init__.py
@@ -7,7 +7,7 @@ USAGE: https://spacy.io/usage/visualizers
from typing import Union, Iterable, Optional, Dict, Any, Callable
import warnings
-from .render import DependencyRenderer, EntityRenderer
+from .render import DependencyRenderer, EntityRenderer, SpanRenderer
from ..tokens import Doc, Span
from ..errors import Errors, Warnings
from ..util import is_in_jupyter
@@ -44,6 +44,7 @@ def render(
factories = {
"dep": (DependencyRenderer, parse_deps),
"ent": (EntityRenderer, parse_ents),
+ "span": (SpanRenderer, parse_spans),
}
if style not in factories:
raise ValueError(Errors.E087.format(style=style))
@@ -55,6 +56,10 @@ def render(
renderer_func, converter = factories[style]
renderer = renderer_func(options=options)
parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore
+ if manual:
+ for doc in docs:
+ if isinstance(doc, dict) and "ents" in doc:
+ doc["ents"] = sorted(doc["ents"], key=lambda x: (x["start"], x["end"]))
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore
html = _html["parsed"]
if RENDER_WRAPPER is not None:
@@ -203,6 +208,42 @@ def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
+def parse_spans(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
+ """Generate spans in [{start: i, end: i, label: 'label'}] format.
+
+ doc (Doc): Document to parse.
+ options (Dict[str, any]): Span-specific visualisation options.
+ RETURNS (dict): Generated span types keyed by text (original text) and spans.
+ """
+ kb_url_template = options.get("kb_url_template", None)
+ spans_key = options.get("spans_key", "sc")
+ spans = [
+ {
+ "start": span.start_char,
+ "end": span.end_char,
+ "start_token": span.start,
+ "end_token": span.end,
+ "label": span.label_,
+ "kb_id": span.kb_id_ if span.kb_id_ else "",
+ "kb_url": kb_url_template.format(span.kb_id_) if kb_url_template else "#",
+ }
+ for span in doc.spans[spans_key]
+ ]
+ tokens = [token.text for token in doc]
+
+ if not spans:
+ warnings.warn(Warnings.W117.format(spans_key=spans_key))
+ title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
+ settings = get_doc_settings(doc)
+ return {
+ "text": doc.text,
+ "spans": spans,
+ "title": title,
+ "settings": settings,
+ "tokens": tokens,
+ }
+
+
def set_render_wrapper(func: Callable[[str], str]) -> None:
"""Set an optional wrapper function that is called around the generated
HTML markup on displacy.render. This can be used to allow integration into
diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py
index a032d843b..247ad996b 100644
--- a/spacy/displacy/render.py
+++ b/spacy/displacy/render.py
@@ -1,12 +1,15 @@
-from typing import Dict, Any, List, Optional, Union
+from typing import Any, Dict, List, Optional, Tuple, Union
import uuid
+import itertools
-from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS
-from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
-from .templates import TPL_ENTS, TPL_KB_LINK
-from ..util import minify_html, escape_html, registry
from ..errors import Errors
-
+from ..util import escape_html, minify_html, registry
+from .templates import TPL_DEP_ARCS, TPL_DEP_SVG, TPL_DEP_WORDS
+from .templates import TPL_DEP_WORDS_LEMMA, TPL_ENT, TPL_ENT_RTL, TPL_ENTS
+from .templates import TPL_FIGURE, TPL_KB_LINK, TPL_PAGE, TPL_SPAN
+from .templates import TPL_SPAN_RTL, TPL_SPAN_SLICE, TPL_SPAN_SLICE_RTL
+from .templates import TPL_SPAN_START, TPL_SPAN_START_RTL, TPL_SPANS
+from .templates import TPL_TITLE
DEFAULT_LANG = "en"
DEFAULT_DIR = "ltr"
@@ -33,6 +36,168 @@ DEFAULT_LABEL_COLORS = {
}
+class SpanRenderer:
+ """Render Spans as SVGs."""
+
+ style = "span"
+
+ def __init__(self, options: Dict[str, Any] = {}) -> None:
+ """Initialise span renderer
+
+ options (dict): Visualiser-specific options (colors, spans)
+ """
+ # Set up the colors and overall look
+ colors = dict(DEFAULT_LABEL_COLORS)
+ user_colors = registry.displacy_colors.get_all()
+ for user_color in user_colors.values():
+ if callable(user_color):
+ # Since this comes from the function registry, we want to make
+ # sure we support functions that *return* a dict of colors
+ user_color = user_color()
+ if not isinstance(user_color, dict):
+ raise ValueError(Errors.E925.format(obj=type(user_color)))
+ colors.update(user_color)
+ colors.update(options.get("colors", {}))
+ self.default_color = DEFAULT_ENTITY_COLOR
+ self.colors = {label.upper(): color for label, color in colors.items()}
+
+ # Set up how the text and labels will be rendered
+ self.direction = DEFAULT_DIR
+ self.lang = DEFAULT_LANG
+ self.top_offset = options.get("top_offset", 40)
+ self.top_offset_step = options.get("top_offset_step", 17)
+
+ # Set up which templates will be used
+ template = options.get("template")
+ if template:
+ self.span_template = template["span"]
+ self.span_slice_template = template["slice"]
+ self.span_start_template = template["start"]
+ else:
+ if self.direction == "rtl":
+ self.span_template = TPL_SPAN_RTL
+ self.span_slice_template = TPL_SPAN_SLICE_RTL
+ self.span_start_template = TPL_SPAN_START_RTL
+ else:
+ self.span_template = TPL_SPAN
+ self.span_slice_template = TPL_SPAN_SLICE
+ self.span_start_template = TPL_SPAN_START
+
+ def render(
+ self, parsed: List[Dict[str, Any]], page: bool = False, minify: bool = False
+ ) -> str:
+ """Render complete markup.
+
+ parsed (list): Dependency parses to render.
+ page (bool): Render parses wrapped as full HTML page.
+ minify (bool): Minify HTML markup.
+ RETURNS (str): Rendered HTML markup.
+ """
+ rendered = []
+ for i, p in enumerate(parsed):
+ if i == 0:
+ settings = p.get("settings", {})
+ self.direction = settings.get("direction", DEFAULT_DIR)
+ self.lang = settings.get("lang", DEFAULT_LANG)
+ rendered.append(self.render_spans(p["tokens"], p["spans"], p.get("title")))
+
+ if page:
+ docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
+ markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
+ else:
+ markup = "".join(rendered)
+ if minify:
+ return minify_html(markup)
+ return markup
+
+ def render_spans(
+ self,
+ tokens: List[str],
+ spans: List[Dict[str, Any]],
+ title: Optional[str],
+ ) -> str:
+ """Render span types in text.
+
+ Spans are rendered per-token, this means that for each token, we check if it's part
+ of a span slice (a member of a span type) or a span start (the starting token of a
+ given span type).
+
+ tokens (list): Individual tokens in the text
+ spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
+ title (str / None): Document title set in Doc.user_data['title'].
+ """
+ per_token_info = []
+ for idx, token in enumerate(tokens):
+ # Identify if a token belongs to a Span (and which) and if it's a
+ # start token of said Span. We'll use this for the final HTML render
+ token_markup: Dict[str, Any] = {}
+ token_markup["text"] = token
+ entities = []
+ for span in spans:
+ ent = {}
+ if span["start_token"] <= idx < span["end_token"]:
+ ent["label"] = span["label"]
+ ent["is_start"] = True if idx == span["start_token"] else False
+ kb_id = span.get("kb_id", "")
+ kb_url = span.get("kb_url", "#")
+ ent["kb_link"] = (
+ TPL_KB_LINK.format(kb_id=kb_id, kb_url=kb_url) if kb_id else ""
+ )
+ entities.append(ent)
+ token_markup["entities"] = entities
+ per_token_info.append(token_markup)
+
+ markup = self._render_markup(per_token_info)
+ markup = TPL_SPANS.format(content=markup, dir=self.direction)
+ if title:
+ markup = TPL_TITLE.format(title=title) + markup
+ return markup
+
+ def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
+ """Render the markup from per-token information"""
+ markup = ""
+ for token in per_token_info:
+ entities = sorted(token["entities"], key=lambda d: d["label"])
+ if entities:
+ slices = self._get_span_slices(token["entities"])
+ starts = self._get_span_starts(token["entities"])
+ markup += self.span_template.format(
+ text=token["text"], span_slices=slices, span_starts=starts
+ )
+ else:
+ markup += escape_html(token["text"] + " ")
+ return markup
+
+ def _get_span_slices(self, entities: List[Dict]) -> str:
+ """Get the rendered markup of all Span slices"""
+ span_slices = []
+ for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
+ color = self.colors.get(entity["label"].upper(), self.default_color)
+ span_slice = self.span_slice_template.format(
+ bg=color, top_offset=self.top_offset + step
+ )
+ span_slices.append(span_slice)
+ return "".join(span_slices)
+
+ def _get_span_starts(self, entities: List[Dict]) -> str:
+ """Get the rendered markup of all Span start tokens"""
+ span_starts = []
+ for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
+ color = self.colors.get(entity["label"].upper(), self.default_color)
+ span_start = (
+ self.span_start_template.format(
+ bg=color,
+ top_offset=self.top_offset + step,
+ label=entity["label"],
+ kb_link=entity["kb_link"],
+ )
+ if entity["is_start"]
+ else ""
+ )
+ span_starts.append(span_start)
+ return "".join(span_starts)
+
+
class DependencyRenderer:
"""Render dependency parses as SVGs."""
@@ -105,7 +270,7 @@ class DependencyRenderer:
RETURNS (str): Rendered SVG markup.
"""
self.levels = self.get_levels(arcs)
- self.highest_level = len(self.levels)
+ self.highest_level = max(self.levels.values(), default=0)
self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke
self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing
@@ -165,7 +330,7 @@ class DependencyRenderer:
if start < 0 or end < 0:
error_args = dict(start=start, end=end, label=label, dir=direction)
raise ValueError(Errors.E157.format(**error_args))
- level = self.levels.index(end - start) + 1
+ level = self.levels[(start, end, label)]
x_start = self.offset_x + start * self.distance + self.arrow_spacing
if self.direction == "rtl":
x_start = self.width - x_start
@@ -181,7 +346,7 @@ class DependencyRenderer:
y_curve = self.offset_y - level * self.distance / 2
if self.compact:
y_curve = self.offset_y - level * self.distance / 6
- if y_curve == 0 and len(self.levels) > 5:
+ if y_curve == 0 and max(self.levels.values(), default=0) > 5:
y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end)
@@ -225,15 +390,23 @@ class DependencyRenderer:
p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2)
return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}"
- def get_levels(self, arcs: List[Dict[str, Any]]) -> List[int]:
+ def get_levels(self, arcs: List[Dict[str, Any]]) -> Dict[Tuple[int, int, str], int]:
"""Calculate available arc height "levels".
Used to calculate arrow heights dynamically and without wasting space.
args (list): Individual arcs and their start, end, direction and label.
- RETURNS (list): Arc levels sorted from lowest to highest.
+ RETURNS (dict): Arc levels keyed by (start, end, label).
"""
- levels = set(map(lambda arc: arc["end"] - arc["start"], arcs))
- return sorted(list(levels))
+ arcs = [dict(t) for t in {tuple(sorted(arc.items())) for arc in arcs}]
+ length = max([arc["end"] for arc in arcs], default=0)
+ max_level = [0] * length
+ levels = {}
+ for arc in sorted(arcs, key=lambda arc: arc["end"] - arc["start"]):
+ level = max(max_level[arc["start"] : arc["end"]]) + 1
+ for i in range(arc["start"], arc["end"]):
+ max_level[i] = level
+ levels[(arc["start"], arc["end"], arc["label"])] = level
+ return levels
class EntityRenderer:
@@ -242,7 +415,7 @@ class EntityRenderer:
style = "ent"
def __init__(self, options: Dict[str, Any] = {}) -> None:
- """Initialise dependency renderer.
+ """Initialise entity renderer.
options (dict): Visualiser-specific options (colors, ents)
"""
diff --git a/spacy/displacy/templates.py b/spacy/displacy/templates.py
index e7d3d4266..ff81e7a1d 100644
--- a/spacy/displacy/templates.py
+++ b/spacy/displacy/templates.py
@@ -62,6 +62,55 @@ TPL_ENT_RTL = """
"""
+TPL_SPANS = """
+
{content}
+"""
+
+TPL_SPAN = """
+
+ {text}
+ {span_slices}
+ {span_starts}
+
+"""
+
+TPL_SPAN_SLICE = """
+
+
+"""
+
+
+TPL_SPAN_START = """
+
+
+ {label}{kb_link}
+
+
+
+"""
+
+TPL_SPAN_RTL = """
+
+ {text}
+ {span_slices}
+ {span_starts}
+
+"""
+
+TPL_SPAN_SLICE_RTL = """
+
+
+"""
+
+TPL_SPAN_START_RTL = """
+
+
+ {label}{kb_link}
+
+
+"""
+
+
# Important: this needs to start with a space!
TPL_KB_LINK = """
{kb_id}
diff --git a/spacy/errors.py b/spacy/errors.py
index 390612123..b01afcb80 100644
--- a/spacy/errors.py
+++ b/spacy/errors.py
@@ -192,6 +192,13 @@ class Warnings(metaclass=ErrorsWithCodes):
W115 = ("Skipping {method}: the floret vector table cannot be modified. "
"Vectors are calculated from character ngrams.")
W116 = ("Unable to clean attribute '{attr}'.")
+ W117 = ("No spans to visualize found in Doc object with spans_key: '{spans_key}'. If this is "
+ "surprising to you, make sure the Doc was processed using a model "
+ "that supports span categorization, and check the `doc.spans[spans_key]` "
+ "property manually if necessary.")
+ W118 = ("Term '{term}' not found in glossary. It may however be explained in documentation "
+ "for the corpora used to train the language. Please check "
+ "`nlp.meta[\"sources\"]` for any relevant links.")
class Errors(metaclass=ErrorsWithCodes):
@@ -483,7 +490,7 @@ class Errors(metaclass=ErrorsWithCodes):
"components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}")
- E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
+ E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: `{attr}.???`")
@@ -520,10 +527,14 @@ class Errors(metaclass=ErrorsWithCodes):
E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x
+ E855 = ("Invalid {obj}: {obj} is not from the same doc.")
+ E856 = ("Error accessing span at position {i}: out of bounds in span group "
+ "of length {length}.")
+ E857 = ("Entry '{name}' not found in edit tree lemmatizer labels.")
E858 = ("The {mode} vector table does not support this operation. "
"{alternative}")
E859 = ("The floret vector table cannot be modified.")
- E860 = ("Can't truncate fasttext-bloom vectors.")
+ E860 = ("Can't truncate floret vectors.")
E861 = ("No 'keys' should be provided when initializing floret vectors "
"with 'minn' and 'maxn'.")
E862 = ("'hash_count' must be between 1-4 for floret vectors.")
@@ -566,9 +577,6 @@ class Errors(metaclass=ErrorsWithCodes):
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
"a list of spans, with each span represented by a tuple (start_char, end_char). "
"The tuple can be optionally extended with a label and a KB ID.")
- E880 = ("The 'wandb' library could not be found - did you install it? "
- "Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
- "config section, instead of the 'WandbLogger'.")
E884 = ("The pipeline could not be initialized because the vectors "
"could not be found at '{vectors}'. If your pipeline was already "
"initialized/trained before, call 'resume_training' instead of 'initialize', "
@@ -894,7 +902,18 @@ class Errors(metaclass=ErrorsWithCodes):
"patterns.")
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
"supported values are: 'I', 'O', 'B' and ''")
-
+ E1026 = ("Edit tree has an invalid format:\n{errors}")
+ E1027 = ("AlignmentArray only supports slicing with a step of 1.")
+ E1028 = ("AlignmentArray only supports indexing using an int or a slice.")
+ E1029 = ("Edit tree cannot be applied to form.")
+ E1030 = ("Edit tree identifier out of range.")
+ E1031 = ("Could not find gold transition - see logs above.")
+ E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
+ E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
+ E1034 = ("Node index {i} out of bounds ({length})")
+ E1035 = ("Token index {i} out of bounds ({length})")
+ E1036 = ("Cannot index into NoneNode")
+
# Deprecated model shortcuts, only used in errors and warnings
OLD_MODEL_SHORTCUTS = {
diff --git a/spacy/glossary.py b/spacy/glossary.py
index e45704fc5..25c00d3ed 100644
--- a/spacy/glossary.py
+++ b/spacy/glossary.py
@@ -1,3 +1,7 @@
+import warnings
+from .errors import Warnings
+
+
def explain(term):
"""Get a description for a given POS tag, dependency label or entity type.
@@ -11,6 +15,8 @@ def explain(term):
"""
if term in GLOSSARY:
return GLOSSARY[term]
+ else:
+ warnings.warn(Warnings.W118.format(term=term))
GLOSSARY = {
@@ -310,7 +316,6 @@ GLOSSARY = {
"re": "repeated element",
"rs": "reported speech",
"sb": "subject",
- "sb": "subject",
"sbp": "passivized subject (PP)",
"sp": "subject or predicate",
"svp": "separable verb prefix",
diff --git a/spacy/lang/dsb/__init__.py b/spacy/lang/dsb/__init__.py
new file mode 100644
index 000000000..c66092a0c
--- /dev/null
+++ b/spacy/lang/dsb/__init__.py
@@ -0,0 +1,16 @@
+from .lex_attrs import LEX_ATTRS
+from .stop_words import STOP_WORDS
+from ...language import Language, BaseDefaults
+
+
+class LowerSorbianDefaults(BaseDefaults):
+ lex_attr_getters = LEX_ATTRS
+ stop_words = STOP_WORDS
+
+
+class LowerSorbian(Language):
+ lang = "dsb"
+ Defaults = LowerSorbianDefaults
+
+
+__all__ = ["LowerSorbian"]
diff --git a/spacy/lang/dsb/examples.py b/spacy/lang/dsb/examples.py
new file mode 100644
index 000000000..6e9143826
--- /dev/null
+++ b/spacy/lang/dsb/examples.py
@@ -0,0 +1,15 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.dsb.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+ "Z tym stwori so wumฤnjenje a zakลad za dalลกe wobdลบฤลanje pลez analyzu tekstoweje struktury a semantisku anotaciju a z tym teลพ za tu pลedstajenu digitalnu online-wersiju.",
+ "Mi so tu jara derje spodoba.",
+ "Kotre nowniny chceฤe mฤฤ?",
+ "Tak ako w slฤdnem lฤลe jo teke lฤtosa jano doma zapustowaล mรณลพno.",
+ "Zwรณstanjo pรณtakem hyลกฤi wjele ลบฤลa.",
+]
diff --git a/spacy/lang/dsb/lex_attrs.py b/spacy/lang/dsb/lex_attrs.py
new file mode 100644
index 000000000..367b3afb8
--- /dev/null
+++ b/spacy/lang/dsb/lex_attrs.py
@@ -0,0 +1,113 @@
+from ...attrs import LIKE_NUM
+
+_num_words = [
+ "nul",
+ "jaden",
+ "jadna",
+ "jadno",
+ "dwa",
+ "dwฤ",
+ "tลi",
+ "tลo",
+ "styri",
+ "styrjo",
+ "pฤล",
+ "pฤลo",
+ "ลกesฤ",
+ "ลกesฤo",
+ "sedym",
+ "sedymjo",
+ "wรณsym",
+ "wรณsymjo",
+ "ลบewjeล",
+ "ลบewjeลo",
+ "ลบaseล",
+ "ลบaseลo",
+ "jadnassฤo",
+ "dwanassฤo",
+ "tลinasฤo",
+ "styrnasฤo",
+ "pฤลnasฤo",
+ "ลกesnasฤo",
+ "sedymnasฤo",
+ "wรณsymnasฤo",
+ "ลบewjeลnasฤo",
+ "dwanasฤo",
+ "dwaลบasฤa",
+ "tลiลบasฤa",
+ "styrลบasฤa",
+ "pฤลลบaset",
+ "ลกesฤลบaset",
+ "sedymลบaset",
+ "wรณsymลบaset",
+ "ลบewjeลลบaset",
+ "sto",
+ "tysac",
+ "milion",
+ "miliarda",
+ "bilion",
+ "biliarda",
+ "trilion",
+ "triliarda",
+]
+
+_ordinal_words = [
+ "prฤdny",
+ "prฤdna",
+ "prฤdne",
+ "drugi",
+ "druga",
+ "druge",
+ "tลeลi",
+ "tลeลa",
+ "tลeลe",
+ "stwรณrty",
+ "stwรณrta",
+ "stwรณrte",
+ "pรชty",
+ "pฤta",
+ "pรชte",
+ "ลกesty",
+ "ลกesta",
+ "ลกeste",
+ "sedymy",
+ "sedyma",
+ "sedyme",
+ "wรณsymy",
+ "wรณsyma",
+ "wรณsyme",
+ "ลบewjety",
+ "ลบewjeta",
+ "ลบewjete",
+ "ลบasety",
+ "ลบaseta",
+ "ลบasete",
+ "jadnasty",
+ "jadnasta",
+ "jadnaste",
+ "dwanasty",
+ "dwanasta",
+ "dwanaste",
+]
+
+
+def like_num(text):
+ if text.startswith(("+", "-", "ยฑ", "~")):
+ text = text[1:]
+ text = text.replace(",", "").replace(".", "")
+ if text.isdigit():
+ return True
+ if text.count("/") == 1:
+ num, denom = text.split("/")
+ if num.isdigit() and denom.isdigit():
+ return True
+ text_lower = text.lower()
+ if text_lower in _num_words:
+ return True
+ # Check ordinal number
+ if text_lower in _ordinal_words:
+ return True
+ return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
diff --git a/spacy/lang/dsb/stop_words.py b/spacy/lang/dsb/stop_words.py
new file mode 100644
index 000000000..376e04aa6
--- /dev/null
+++ b/spacy/lang/dsb/stop_words.py
@@ -0,0 +1,15 @@
+STOP_WORDS = set(
+ """
+a abo aby ako ale aลพ
+
+daniลพ dokulaลพ
+
+gaลพ
+
+jolic
+
+pak pรณtom
+
+teke togodla
+""".split()
+)
diff --git a/spacy/lang/en/tokenizer_exceptions.py b/spacy/lang/en/tokenizer_exceptions.py
index 55b544e42..2c20b8c27 100644
--- a/spacy/lang/en/tokenizer_exceptions.py
+++ b/spacy/lang/en/tokenizer_exceptions.py
@@ -447,7 +447,6 @@ for exc_data in [
{ORTH: "La.", NORM: "Louisiana"},
{ORTH: "Mar.", NORM: "March"},
{ORTH: "Mass.", NORM: "Massachusetts"},
- {ORTH: "May.", NORM: "May"},
{ORTH: "Mich.", NORM: "Michigan"},
{ORTH: "Minn.", NORM: "Minnesota"},
{ORTH: "Miss.", NORM: "Mississippi"},
diff --git a/spacy/lang/es/examples.py b/spacy/lang/es/examples.py
index 2bcbd8740..e4dfbcb6d 100644
--- a/spacy/lang/es/examples.py
+++ b/spacy/lang/es/examples.py
@@ -9,14 +9,14 @@ Example sentences to test spaCy and its language models.
sentences = [
"Apple estรก buscando comprar una startup del Reino Unido por mil millones de dรณlares.",
"Los coches autรณnomos delegan la responsabilidad del seguro en sus fabricantes.",
- "San Francisco analiza prohibir los robots delivery.",
+ "San Francisco analiza prohibir los robots de reparto.",
"Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.",
"Veo al hombre con el telescopio.",
"La araรฑa come moscas.",
"El pingรผino incuba en su nido sobre el hielo.",
- "ยฟDรณnde estais?",
- "ยฟQuiรฉn es el presidente Francรฉs?",
- "ยฟDรณnde estรก encuentra la capital de Argentina?",
+ "ยฟDรณnde estรกis?",
+ "ยฟQuiรฉn es el presidente francรฉs?",
+ "ยฟDรณnde se encuentra la capital de Argentina?",
"ยฟCuรกndo naciรณ Josรฉ de San Martรญn?",
]
diff --git a/spacy/lang/es/stop_words.py b/spacy/lang/es/stop_words.py
index 004df4fca..6d2885481 100644
--- a/spacy/lang/es/stop_words.py
+++ b/spacy/lang/es/stop_words.py
@@ -1,82 +1,80 @@
STOP_WORDS = set(
"""
-actualmente acuerdo adelante ademas ademรกs adrede afirmรณ agregรณ ahi ahora ahรญ
-al algo alguna algunas alguno algunos algรบn alli allรญ alrededor ambos ampleamos
-antano antaรฑo ante anterior antes apenas aproximadamente aquel aquella aquellas
-aquello aquellos aqui aquรฉl aquรฉlla aquรฉllas aquรฉllos aquรญ arriba arribaabajo
-asegurรณ asi asรญ atras aun aunque ayer aรฑadiรณ aรบn
+a acuerdo adelante ademas ademรกs afirmรณ agregรณ ahi ahora ahรญ al algo alguna
+algunas alguno algunos algรบn alli allรญ alrededor ambos ante anterior antes
+apenas aproximadamente aquel aquella aquellas aquello aquellos aqui aquรฉl
+aquรฉlla aquรฉllas aquรฉllos aquรญ arriba asegurรณ asi asรญ atras aun aunque aรฑadiรณ
+aรบn
bajo bastante bien breve buen buena buenas bueno buenos
-cada casi cerca cierta ciertas cierto ciertos cinco claro comentรณ como con
-conmigo conocer conseguimos conseguir considera considerรณ consigo consigue
-consiguen consigues contigo contra cosas creo cual cuales cualquier cuando
-cuanta cuantas cuanto cuantos cuatro cuenta cuรกl cuรกles cuรกndo cuรกnta cuรกntas
-cuรกnto cuรกntos cรณmo
+cada casi cierta ciertas cierto ciertos cinco claro comentรณ como con conmigo
+conocer conseguimos conseguir considera considerรณ consigo consigue consiguen
+consigues contigo contra creo cual cuales cualquier cuando cuanta cuantas
+cuanto cuantos cuatro cuenta cuรกl cuรกles cuรกndo cuรกnta cuรกntas cuรกnto cuรกntos
+cรณmo
da dado dan dar de debajo debe deben debido decir dejรณ del delante demasiado
demรกs dentro deprisa desde despacio despues despuรฉs detras detrรกs dia dias dice
-dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante dรญa
-dรญas dรณnde
+dicen dicho dieron diez diferente diferentes dijeron dijo dio doce donde dos
+durante dรญa dรญas dรณnde
-ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas
-empleo en encima encuentra enfrente enseguida entonces entre era eramos eran
-eras eres es esa esas ese eso esos esta estaba estaban estado estados estais
-estamos estan estar estarรก estas este esto estos estoy estuvo estรก estรกn ex
-excepto existe existen explicรณ expresรณ รฉl รฉsa รฉsas รฉse รฉsos รฉsta รฉstas รฉste
-รฉstos
+e el ella ellas ello ellos embargo en encima encuentra enfrente enseguida
+entonces entre era eramos eran eras eres es esa esas ese eso esos esta estaba
+estaban estado estados estais estamos estan estar estarรก estas este esto estos
+estoy estuvo estรก estรกn excepto existe existen explicรณ expresรณ รฉl รฉsa รฉsas รฉse
+รฉsos รฉsta รฉstas รฉste รฉstos
fin final fue fuera fueron fui fuimos
-general gran grandes gueno
+gran grande grandes
ha haber habia habla hablan habrรก habรญa habรญan hace haceis hacemos hacen hacer
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
-hizo horas hoy hubo
+hizo hoy hubo
-igual incluso indicรณ informo informรณ intenta intentais intentamos intentan
-intentar intentas intento ir
+igual incluso indicรณ informo informรณ ir
junto
-la lado largo las le lejos les llegรณ lleva llevar lo los luego lugar
+la lado largo las le les llegรณ lleva llevar lo los luego
mal manera manifestรณ mas mayor me mediante medio mejor mencionรณ menos menudo mi
-mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha
-muchas mucho muchos muy mรกs mรญ mรญa mรญas mรญo mรญos
+mia mias mientras mio mios mis misma mismas mismo mismos modo mucha muchas
+mucho muchos muy mรกs mรญ mรญa mรญas mรญo mรญos
nada nadie ni ninguna ningunas ninguno ningunos ningรบn no nos nosotras nosotros
-nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca
+nuestra nuestras nuestro nuestros nueva nuevas nueve nuevo nuevos nunca
-ocho os otra otras otro otros
+o ocho once os otra otras otro otros
-pais para parece parte partir pasada pasado paรฌs peor pero pesar poca pocas
-poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrรก
+para parece parte partir pasada pasado paรฌs peor pero pesar poca pocas poco
+pocos podeis podemos poder podria podriais podriamos podrian podrias podrรก
podrรกn podrรญa podrรญan poner por porque posible primer primera primero primeros
-principalmente pronto propia propias propio propios proximo prรณximo prรณximos
-pudo pueda puede pueden puedo pues
+pronto propia propias propio propios proximo prรณximo prรณximos pudo pueda puede
+pueden puedo pues
-qeu que quedรณ queremos quien quienes quiere quiza quizas quizรก quizรกs quiรฉn quiรฉnes quรฉ
+qeu que quedรณ queremos quien quienes quiere quiza quizas quizรก quizรกs quiรฉn
+quiรฉnes quรฉ
-raras realizado realizar realizรณ repente respecto
+realizado realizar realizรณ repente respecto
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
segรบn seis ser sera serรก serรกn serรญa seรฑalรณ si sido siempre siendo siete sigue
-siguiente sin sino sobre sois sola solamente solas solo solos somos son soy
-soyos su supuesto sus suya suyas suyo sรฉ sรญ sรณlo
+siguiente sin sino sobre sois sola solamente solas solo solos somos son soy su
+supuesto sus suya suyas suyo suyos sรฉ sรญ sรณlo
tal tambien tambiรฉn tampoco tan tanto tarde te temprano tendrรก tendrรกn teneis
-tenemos tener tenga tengo tenido tenรญa tercera ti tiempo tiene tienen toda
-todas todavia todavรญa todo todos total trabaja trabajais trabajamos trabajan
-trabajar trabajas trabajo tras trata travรฉs tres tu tus tuvo tuya tuyas tuyo
-tuyos tรบ
+tenemos tener tenga tengo tenido tenรญa tercera tercero ti tiene tienen toda
+todas todavia todavรญa todo todos total tras trata travรฉs tres tu tus tuvo tuya
+tuyas tuyo tuyos tรบ
-ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
+u ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
รบltima รบltimas รบltimo รบltimos
-va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero
-vez vosotras vosotros voy vuestra vuestras vuestro vuestros
+va vais vamos van varias varios vaya veces ver verdad verdadera verdadero vez
+vosotras vosotros voy vuestra vuestras vuestro vuestros
-ya yo
+y ya yo
""".split()
)
diff --git a/spacy/lang/fr/lex_attrs.py b/spacy/lang/fr/lex_attrs.py
index da98c6e37..811312ad7 100644
--- a/spacy/lang/fr/lex_attrs.py
+++ b/spacy/lang/fr/lex_attrs.py
@@ -3,7 +3,7 @@ from ...attrs import LIKE_NUM
_num_words = set(
"""
-zero un deux trois quatre cinq six sept huit neuf dix
+zero un une deux trois quatre cinq six sept huit neuf dix
onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf
vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante
cent mille mil million milliard billion quadrillion quintillion
@@ -13,7 +13,7 @@ sextillion septillion octillion nonillion decillion
_ordinal_words = set(
"""
-premier deuxiรจme second troisiรจme quatriรจme cinquiรจme sixiรจme septiรจme huitiรจme neuviรจme dixiรจme
+premier premiรจre deuxiรจme second seconde troisiรจme quatriรจme cinquiรจme sixiรจme septiรจme huitiรจme neuviรจme dixiรจme
onziรจme douziรจme treiziรจme quatorziรจme quinziรจme seiziรจme dix-septiรจme dix-huitiรจme dix-neuviรจme
vingtiรจme trentiรจme quarantiรจme cinquantiรจme soixantiรจme soixante-dixiรจme septantiรจme quatre-vingtiรจme huitantiรจme quatre-vingt-dixiรจme nonantiรจme
centiรจme milliรจme millionniรจme milliardiรจme billionniรจme quadrillionniรจme quintillionniรจme
diff --git a/spacy/lang/fr/syntax_iterators.py b/spacy/lang/fr/syntax_iterators.py
index 5f7ba5c10..5849c40b3 100644
--- a/spacy/lang/fr/syntax_iterators.py
+++ b/spacy/lang/fr/syntax_iterators.py
@@ -64,9 +64,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
prev_end = right_end.i
left_index = word.left_edge.i
- left_index = (
- left_index + 1 if word.left_edge.pos == adp_pos else left_index
- )
+ left_index = left_index + 1 if word.left_edge.pos == adp_pos else left_index
yield left_index, right_end.i + 1, np_label
elif word.dep == conj_label:
diff --git a/spacy/lang/hsb/__init__.py b/spacy/lang/hsb/__init__.py
new file mode 100644
index 000000000..034d82319
--- /dev/null
+++ b/spacy/lang/hsb/__init__.py
@@ -0,0 +1,18 @@
+from .lex_attrs import LEX_ATTRS
+from .stop_words import STOP_WORDS
+from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
+from ...language import Language, BaseDefaults
+
+
+class UpperSorbianDefaults(BaseDefaults):
+ lex_attr_getters = LEX_ATTRS
+ stop_words = STOP_WORDS
+ tokenizer_exceptions = TOKENIZER_EXCEPTIONS
+
+
+class UpperSorbian(Language):
+ lang = "hsb"
+ Defaults = UpperSorbianDefaults
+
+
+__all__ = ["UpperSorbian"]
diff --git a/spacy/lang/hsb/examples.py b/spacy/lang/hsb/examples.py
new file mode 100644
index 000000000..21f6f7584
--- /dev/null
+++ b/spacy/lang/hsb/examples.py
@@ -0,0 +1,15 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.hsb.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+ "To bฤลกo wjelgin raลบone a jo se wรณt luลบi derje pลiwzeลo. Tak som doลพywiลa wjelgin",
+ "Jogo pลewรณลบowarce stej groniลej, aลพ how w serbskich stronach njama Santa Claus nic pytaล.",
+ "A ten sobuลบฤลaลeล Statneje biblioteki w Barlinju jo pลimjeล drogotne knigลy bลบez rukajcowu z nagima rukoma!",
+ "Take wobchadanje z naลกym kulturnym derbstwom zewลกym njejลบo.",
+ "Wopลimjeลe drugich pลinoskow jo byลo na wusokem niwowje, ako pลecej.",
+]
diff --git a/spacy/lang/hsb/lex_attrs.py b/spacy/lang/hsb/lex_attrs.py
new file mode 100644
index 000000000..5f300a73d
--- /dev/null
+++ b/spacy/lang/hsb/lex_attrs.py
@@ -0,0 +1,106 @@
+from ...attrs import LIKE_NUM
+
+_num_words = [
+ "nul",
+ "jedyn",
+ "jedna",
+ "jedne",
+ "dwaj",
+ "dwฤ",
+ "tลi",
+ "tลo",
+ "ลกtyri",
+ "ลกtyrjo",
+ "pjeฤ",
+ "ลกฤsฤ",
+ "sydom",
+ "wosom",
+ "dลบewjeฤ",
+ "dลบesaฤ",
+ "jฤdnaฤe",
+ "dwanaฤe",
+ "tลinaฤe",
+ "ลกtyrnaฤe",
+ "pjatnaฤe",
+ "ลกฤsnaฤe",
+ "sydomnaฤe",
+ "wosomnaฤe",
+ "dลบewjatnaฤe",
+ "dwaceฤi",
+ "tลiceฤi",
+ "ลกtyrceฤi",
+ "pjeฤdลบesat",
+ "ลกฤsฤdลบesat",
+ "sydomdลบesat",
+ "wosomdลบesat",
+ "dลบewjeฤdลบesat",
+ "sto",
+ "tysac",
+ "milion",
+ "miliarda",
+ "bilion",
+ "biliarda",
+ "trilion",
+ "triliarda",
+]
+
+_ordinal_words = [
+ "prฤni",
+ "prฤnja",
+ "prฤnje",
+ "druhi",
+ "druha",
+ "druhe",
+ "tลeฤi",
+ "tลeฤa",
+ "tลeฤe",
+ "ลกtwรณrty",
+ "ลกtwรณrta",
+ "ลกtwรณrte",
+ "pjaty",
+ "pjata",
+ "pjate",
+ "ลกฤsty",
+ "ลกฤsta",
+ "ลกฤste",
+ "sydmy",
+ "sydma",
+ "sydme",
+ "wosmy",
+ "wosma",
+ "wosme",
+ "dลบewjaty",
+ "dลบewjata",
+ "dลบewjate",
+ "dลบesaty",
+ "dลบesata",
+ "dลบesate",
+ "jฤdnaty",
+ "jฤdnata",
+ "jฤdnate",
+ "dwanaty",
+ "dwanata",
+ "dwanate",
+]
+
+
+def like_num(text):
+ if text.startswith(("+", "-", "ยฑ", "~")):
+ text = text[1:]
+ text = text.replace(",", "").replace(".", "")
+ if text.isdigit():
+ return True
+ if text.count("/") == 1:
+ num, denom = text.split("/")
+ if num.isdigit() and denom.isdigit():
+ return True
+ text_lower = text.lower()
+ if text_lower in _num_words:
+ return True
+ # Check ordinal number
+ if text_lower in _ordinal_words:
+ return True
+ return False
+
+
+LEX_ATTRS = {LIKE_NUM: like_num}
diff --git a/spacy/lang/hsb/stop_words.py b/spacy/lang/hsb/stop_words.py
new file mode 100644
index 000000000..e6fedaf4c
--- /dev/null
+++ b/spacy/lang/hsb/stop_words.py
@@ -0,0 +1,19 @@
+STOP_WORDS = set(
+ """
+a abo ale ani
+
+dokelลพ
+
+hdyลพ
+
+jeli jelizo
+
+kaลพ
+
+pak potom
+
+teลพ tohodla
+
+zo zoby
+""".split()
+)
diff --git a/spacy/lang/hsb/tokenizer_exceptions.py b/spacy/lang/hsb/tokenizer_exceptions.py
new file mode 100644
index 000000000..4b9a4f98a
--- /dev/null
+++ b/spacy/lang/hsb/tokenizer_exceptions.py
@@ -0,0 +1,18 @@
+from ..tokenizer_exceptions import BASE_EXCEPTIONS
+from ...symbols import ORTH, NORM
+from ...util import update_exc
+
+_exc = dict()
+for exc_data in [
+ {ORTH: "mil.", NORM: "milion"},
+ {ORTH: "wob.", NORM: "wobydler"},
+]:
+ _exc[exc_data[ORTH]] = [exc_data]
+
+for orth in [
+ "resp.",
+]:
+ _exc[orth] = [{ORTH: orth}]
+
+
+TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
diff --git a/spacy/lang/ko/__init__.py b/spacy/lang/ko/__init__.py
index 05fc67e79..0e02e4a2d 100644
--- a/spacy/lang/ko/__init__.py
+++ b/spacy/lang/ko/__init__.py
@@ -1,12 +1,13 @@
from typing import Iterator, Any, Dict
+from .punctuation import TOKENIZER_INFIXES
from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP
from .lex_attrs import LEX_ATTRS
from ...language import Language, BaseDefaults
from ...tokens import Doc
from ...scorer import Scorer
-from ...symbols import POS
+from ...symbols import POS, X
from ...training import validate_examples
from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab
@@ -31,15 +32,24 @@ def create_tokenizer():
class KoreanTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab):
self.vocab = vocab
- MeCab = try_mecab_import() # type: ignore[func-returns-value]
- self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
+ self._mecab = try_mecab_import() # type: ignore[func-returns-value]
+ self._mecab_tokenizer = None
+
+ @property
+ def mecab_tokenizer(self):
+ # This is a property so that initializing a pipeline with blank:ko is
+ # possible without actually requiring mecab-ko, e.g. to run
+ # `spacy init vectors ko` for a pipeline that will have a different
+ # tokenizer in the end. The languages need to match for the vectors
+ # to be imported and there's no way to pass a custom config to
+ # `init vectors`.
+ if self._mecab_tokenizer is None:
+ self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
+ return self._mecab_tokenizer
def __reduce__(self):
return KoreanTokenizer, (self.vocab,)
- def __del__(self):
- self.mecab_tokenizer.__del__()
-
def __call__(self, text: str) -> Doc:
dtokens = list(self.detailed_tokens(text))
surfaces = [dt["surface"] for dt in dtokens]
@@ -47,7 +57,10 @@ class KoreanTokenizer(DummyTokenizer):
for token, dtoken in zip(doc, dtokens):
first_tag, sep, eomi_tags = dtoken["tag"].partition("+")
token.tag_ = first_tag # stem(์ด๊ฐ) or pre-final(์ ์ด๋ง ์ด๋ฏธ)
- token.pos = TAG_MAP[token.tag_][POS]
+ if token.tag_ in TAG_MAP:
+ token.pos = TAG_MAP[token.tag_][POS]
+ else:
+ token.pos = X
token.lemma_ = dtoken["lemma"]
doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
return doc
@@ -76,6 +89,7 @@ class KoreanDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
+ infixes = TOKENIZER_INFIXES
class Korean(Language):
@@ -90,7 +104,8 @@ def try_mecab_import() -> None:
return MeCab
except ImportError:
raise ImportError(
- "Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
+ 'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
+ "[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)"
) from None
diff --git a/spacy/lang/ko/punctuation.py b/spacy/lang/ko/punctuation.py
new file mode 100644
index 000000000..7f7b40c5b
--- /dev/null
+++ b/spacy/lang/ko/punctuation.py
@@ -0,0 +1,12 @@
+from ..char_classes import LIST_QUOTES
+from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES
+
+
+_infixes = (
+ ["ยท", "ใ", "\(", "\)"]
+ + [r"(?<=[0-9])~(?=[0-9-])"]
+ + LIST_QUOTES
+ + BASE_TOKENIZER_INFIXES
+)
+
+TOKENIZER_INFIXES = _infixes
diff --git a/spacy/lang/ru/lex_attrs.py b/spacy/lang/ru/lex_attrs.py
index 7979c7ea6..2afe47623 100644
--- a/spacy/lang/ru/lex_attrs.py
+++ b/spacy/lang/ru/lex_attrs.py
@@ -1,56 +1,219 @@
from ...attrs import LIKE_NUM
-_num_words = [
- "ะฝะพะปั",
- "ะพะดะธะฝ",
- "ะดะฒะฐ",
- "ััะธ",
- "ัะตัััะต",
- "ะฟััั",
- "ัะตััั",
- "ัะตะผั",
- "ะฒะพัะตะผั",
- "ะดะตะฒััั",
- "ะดะตัััั",
- "ะพะดะธะฝะฝะฐะดัะฐัั",
- "ะดะฒะตะฝะฐะดัะฐัั",
- "ััะธะฝะฐะดัะฐัั",
- "ัะตัััะฝะฐะดัะฐัั",
- "ะฟััะฝะฐะดัะฐัั",
- "ัะตััะฝะฐะดัะฐัั",
- "ัะตะผะฝะฐะดัะฐัั",
- "ะฒะพัะตะผะฝะฐะดัะฐัั",
- "ะดะตะฒััะฝะฐะดัะฐัั",
- "ะดะฒะฐะดัะฐัั",
- "ััะธะดัะฐัั",
- "ัะพัะพะบ",
- "ะฟัััะดะตััั",
- "ัะตัััะดะตััั",
- "ัะตะผัะดะตััั",
- "ะฒะพัะตะผัะดะตััั",
- "ะดะตะฒัะฝะพััะพ",
- "ััะพ",
- "ะดะฒะตััะธ",
- "ััะธััะฐ",
- "ัะตัััะตััะฐ",
- "ะฟััััะพั",
- "ัะตััััะพั",
- "ัะตะผััะพั",
- "ะฒะพัะตะผััะพั",
- "ะดะตะฒััััะพั",
- "ัััััะฐ",
- "ะผะธะปะปะธะพะฝ",
- "ะผะธะปะปะธะฐัะด",
- "ััะธะปะปะธะพะฝ",
- "ะบะฒะฐะดัะธะปะปะธะพะฝ",
- "ะบะฒะธะฝัะธะปะปะธะพะฝ",
-]
+_num_words = list(
+ set(
+ """
+ะฝะพะปั ะฝะพะปั ะฝะพะปั ะฝะพะปัะผ ะฝะพะปะต ะฝัะปะตะฒะพะน ะฝัะปะตะฒะพะณะพ ะฝัะปะตะฒะพะผั ะฝัะปะตะฒัะผ ะฝัะปะตะฒะพะผ ะฝัะปะตะฒะฐั ะฝัะปะตะฒัั ะฝัะปะตะฒะพะต ะฝัะปะตะฒัะต ะฝัะปะตะฒัั ะฝัะปะตะฒัะผะธ
+
+ัะตัะฒะตััั ัะตัะฒะตััะธ ัะตัะฒะตัััั ัะตัะฒะตััะตะน ัะตัะฒะตัััะผ ัะตัะฒะตัััะผะธ ัะตัะฒะตัััั
+
+ััะตัั ััะตัะธ ััะตััั ััะตัะตะน ััะตััะผ ััะตััะผะธ ััะตััั
+
+ะฟะพะปะพะฒะธะฝะฐ ะฟะพะปะพะฒะธะฝั ะฟะพะปะพะฒะธะฝะต ะฟะพะปะพะฒะธะฝั ะฟะพะปะพะฒะธะฝะพะน ะฟะพะปะพะฒะธะฝ ะฟะพะปะพะฒะธะฝะฐะผ ะฟะพะปะพะฒะธะฝะฐะผะธ ะฟะพะปะพะฒะธะฝะฐั ะฟะพะปะพะฒะธะฝะพั
+
+ะพะดะธะฝ ะพะดะฝะพะณะพ ะพะดะฝะพะผั ะพะดะฝะธะผ ะพะดะฝะพะผ
+ะฟะตัะฒะพะน ะฟะตัะฒะพะณะพ ะฟะตัะฒะพะผั ะฟะตัะฒะพะผ ะฟะตัะฒัะน ะฟะตัะฒัะผ ะฟะตัะฒัั
+ะฒะพ-ะฟะตัะฒัั
+ะตะดะธะฝะธัะฐ ะตะดะธะฝะธัั ะตะดะธะฝะธัะต ะตะดะธะฝะธัั ะตะดะธะฝะธัะตะน ะตะดะธะฝะธั ะตะดะธะฝะธัะฐะผ ะตะดะธะฝะธัะฐะผะธ ะตะดะธะฝะธัะฐั ะตะดะธะฝะธัะตั
+
+ะดะฒะฐ ะดะฒัะผั ะดะฒัะผ ะดะฒัั ะดะฒะพะธั ะดะฒะพะต ะดะฒะต
+ะฒัะพัะพะณะพ ะฒัะพัะพะผั ะฒัะพัะพะน ะฒัะพัะพะผ ะฒัะพััะผ ะฒัะพััั
+ะดะฒะพะนะบะฐ ะดะฒะพะนะบะธ ะดะฒะพะนะบะต ะดะฒะพะนะบั ะดะฒะพะนะบะพะน ะดะฒะพะตะบ ะดะฒะพะนะบะฐะผ ะดะฒะพะนะบะฐะผะธ ะดะฒะพะนะบะฐั ะดะฒะพะนะบะพั
+ะฒะพ-ะฒัะพััั
+ะพะฑะฐ ะพะฑะต ะพะฑะตะธะผ ะพะฑะตะธะผะธ ะพะฑะตะธั ะพะฑะพะธะผ ะพะฑะพะธะผะธ ะพะฑะพะธั
+
+ะฟะพะปัะพัะฐ ะฟะพะปัะพัั ะฟะพะปััะพัะฐ
+
+ััะธ ััะตััะตะณะพ ััะตััะตะผั ััะตััะตะผ ััะตััะธะผ ััะตัะธะน ััะตะผั ััะตะผ ััะตั ััะพะต ััะพะธั ัััั
+ััะพะนะบะฐ ััะพะนะบะธ ััะพะนะบะต ััะพะนะบั ััะพะนะบะพั ััะพะตะบ ััะพะนะบะฐะผ ััะพะนะบะฐะผะธ ััะพะนะบะฐั ััะพะนะบะพะน
+ััะพะตัะบะฐ ััะพะตัะบะธ ััะพะตัะบะต ััะพะตัะบั ััะพะตัะบะพะน ััะพะตัะตะบ ััะพะตัะบะฐะผ ััะพะตัะบะฐะผะธ ััะพะตัะบะฐั ััะพะตัะบะพะน
+ััะตัะบะฐ ััะตัะบะธ ััะตัะบะต ััะตัะบั ััะตัะบะพะน ััะตัะตะบ ััะตัะบะฐะผ ััะตัะบะฐะผะธ ััะตัะบะฐั ััะตัะบะพั
+ััััะบะฐ ััััะบะธ ััััะบะต ััััะบั ััััะบะพะน ััััะตะบ ััััะบะฐะผ ััััะบะฐะผะธ ััััะบะฐั ััััะบะพั
+ััะพัะบ ััะพัะบะฐ ััะพัะบั ััะพัะบะพะผ ััะพัะบะต ััะพัะบะธ ััะพัะบะพะฒ ััะพัะบะฐะผ ััะพัะบะฐะผะธ ััะพัะบะฐั
+ััะตั ะฐ ััะตั ั ััะตั ะพะน
+ัััั ะฐ ัััั ั ัััั ะพะน
+ะฒััะพะตะผ ะฒััะพัะผ
+
+ัะตัััะต ัะตัะฒะตััะพะณะพ ัะตัะฒะตััะพะผั ัะตัะฒะตััะพะผ ัะตัะฒะตัััะน ัะตัะฒะตัััะผ ัะตัะฒะตัะบะฐ ัะตััััะผั ัะตัััะตะผ ัะตัััะตั ัะตัะฒะตัะพ ัะตััััั ัะตัะฒะตััะผ
+ัะตัะฒะตััั
+ะฒัะตัะฒะตัะพะผ
+
+ะฟััั ะฟััะพะณะพ ะฟััะพะผั ะฟััะพะผ ะฟัััะน ะฟัััะผ ะฟัััั ะฟััะธ ะฟััะตัะพ ะฟััะตััั ะฟััะตััะผะธ
+ะฒะฟััะตัะพะผ
+ะฟััะตัะพัะบะฐ ะฟััะตัะพัะบะธ ะฟััะตัะพัะบะต ะฟััะตัะพัะบะฐะผะธ ะฟััะตัะพัะบะพะน ะฟััะตัะพัะบั ะฟััะตัะพัะบะพะน ะฟััะตัะพัะบะฐะผะธ
+ะฟััััะพัะบะฐ ะฟััััะพัะบะธ ะฟััััะพัะบะต ะฟััััะพัะบะฐะผะธ ะฟััััะพัะบะพะน ะฟััััะพัะบั ะฟััััะพัะบะพะน ะฟััััะพัะบะฐะผะธ
+ะฟััะตัะบะฐ ะฟััะตัะบะธ ะฟััะตัะบะต ะฟััะตัะบะฐะผะธ ะฟััะตัะบะพะน ะฟััะตัะบั ะฟััะตัะบะฐะผะธ
+ะฟััััะบะฐ ะฟััััะบะธ ะฟััััะบะต ะฟััััะบะฐะผะธ ะฟััััะบะพะน ะฟััััะบั ะฟััััะบะฐะผะธ
+ะฟััััะฐ ะฟััััั ะฟััััะต ะฟััััะฐะผะธ ะฟััััะพะน ะฟััััั ะฟััััะฐะผะธ
+ะฟััะตัะฐ ะฟััะตัั ะฟััะตัะต ะฟััะตัะฐะผะธ ะฟััะตัะพะน ะฟััะตัั ะฟััะตัะฐะผะธ
+ะฟััะฐะบ ะฟััะฐะบะธ ะฟััะฐะบะต ะฟััะฐะบะฐะผะธ ะฟััะฐะบะพะผ ะฟััะฐะบั ะฟััะฐะบะฐะผะธ
+
+ัะตััั ัะตััะตัะบะฐ ัะตััะพะณะพ ัะตััะพะผั ัะตััะพะน ัะตััะพะผ ัะตัััะผ ัะตัััั ัะตััะธ ัะตััะตัะพ ัะตััะตััั
+ะฒัะตััะตัะพะผ
+
+ัะตะผั ัะตะผะตัะบะฐ ัะตะดัะผะพะณะพ ัะตะดัะผะพะผั ัะตะดัะผะพะน ัะตะดัะผะพะผ ัะตะดัะผัะผ ัะตะผัั ัะตะผะธ ัะตะผะตัะพ ัะตะดัะผัั
+ะฒัะตะผะตัะพะผ
+
+ะฒะพัะตะผั ะฒะพััะผะตัะบะฐ ะฒะพััะผะพะณะพ ะฒะพััะผะพะผั ะฒะพัะตะผัั ะฒะพััะผะพะน ะฒะพััะผะพะผ ะฒะพััะผัะผ ะฒะพัะตะผะธ ะฒะพััะผะตัะพะผ ะฒะพััะผะธ ะฒะพััะผัั
+ะฒะพััะผะตััั
+ะฒะฒะพััะผะตัะพะผ
+
+ะดะตะฒััั ะดะตะฒััะพะณะพ ะดะตะฒััะพะผั ะดะตะฒััะบะฐ ะดะตะฒััะพะผ ะดะตะฒัััะน ะดะตะฒัััะผ ะดะตะฒัััั ะดะตะฒััะธ ะดะตะฒััะตัะพะผ ะฒะดะตะฒััะตัะพะผ ะดะตะฒััะตััั
+ะฒะดะตะฒััะตัะพะผ
+
+ะดะตัััั ะดะตัััะพะณะพ ะดะตัััะพะผั ะดะตัััะบะฐ ะดะตัััะพะผ ะดะตััััะน ะดะตััััะผ ะดะตััััั ะดะตัััะธ ะดะตัััะตัะพะผ ะดะตััััั
+ะฒะดะตัััะตัะพะผ
+
+ะพะดะธะฝะฝะฐะดัะฐัั ะพะดะธะฝะฝะฐะดัะฐัะพะณะพ ะพะดะธะฝะฝะฐะดัะฐัะพะผั ะพะดะธะฝะฝะฐะดัะฐัะพะผ ะพะดะธะฝะฝะฐะดัะฐััะน ะพะดะธะฝะฝะฐะดัะฐััะผ ะพะดะธะฝะฝะฐะดัะฐััั ะพะดะธะฝะฝะฐะดัะฐัะธ
+ะพะดะธะฝะฝะฐะดัะฐััั
+
+ะดะฒะตะฝะฐะดัะฐัั ะดะฒะตะฝะฐะดัะฐัะพะณะพ ะดะฒะตะฝะฐะดัะฐัะพะผั ะดะฒะตะฝะฐะดัะฐัะพะผ ะดะฒะตะฝะฐะดัะฐััะน ะดะฒะตะฝะฐะดัะฐััะผ ะดะฒะตะฝะฐะดัะฐััั ะดะฒะตะฝะฐะดัะฐัะธ
+ะดะฒะตะฝะฐะดัะฐััั
+
+ััะธะฝะฐะดัะฐัั ััะธะฝะฐะดัะฐัะพะณะพ ััะธะฝะฐะดัะฐัะพะผั ััะธะฝะฐะดัะฐัะพะผ ััะธะฝะฐะดัะฐััะน ััะธะฝะฐะดัะฐััะผ ััะธะฝะฐะดัะฐััั ััะธะฝะฐะดัะฐัะธ
+ััะธะฝะฐะดัะฐััั
+
+ัะตัััะฝะฐะดัะฐัั ัะตัััะฝะฐะดัะฐัะพะณะพ ัะตัััะฝะฐะดัะฐัะพะผั ัะตัััะฝะฐะดัะฐัะพะผ ัะตัััะฝะฐะดัะฐััะน ัะตัััะฝะฐะดัะฐััะผ ัะตัััะฝะฐะดัะฐััั ัะตัััะฝะฐะดัะฐัะธ
+ัะตัััะฝะฐะดัะฐััั
+
+ะฟััะฝะฐะดัะฐัั ะฟััะฝะฐะดัะฐัะพะณะพ ะฟััะฝะฐะดัะฐัะพะผั ะฟััะฝะฐะดัะฐัะพะผ ะฟััะฝะฐะดัะฐััะน ะฟััะฝะฐะดัะฐััะผ ะฟััะฝะฐะดัะฐััั ะฟััะฝะฐะดัะฐัะธ
+ะฟััะฝะฐะดัะฐััั
+ะฟััะฝะฐัะธะบ ะฟััะฝะฐัะธะบั ะฟััะฝะฐัะธะบะพะผ ะฟััะฝะฐัะธะบะธ
+
+ัะตััะฝะฐะดัะฐัั ัะตััะฝะฐะดัะฐัะพะณะพ ัะตััะฝะฐะดัะฐัะพะผั ัะตััะฝะฐะดัะฐัะพะผ ัะตััะฝะฐะดัะฐััะน ัะตััะฝะฐะดัะฐััะผ ัะตััะฝะฐะดัะฐััั ัะตััะฝะฐะดัะฐัะธ
+ัะตััะฝะฐะดัะฐััั
+
+ัะตะผะฝะฐะดัะฐัั ัะตะผะฝะฐะดัะฐัะพะณะพ ัะตะผะฝะฐะดัะฐัะพะผั ัะตะผะฝะฐะดัะฐัะพะผ ัะตะผะฝะฐะดัะฐััะน ัะตะผะฝะฐะดัะฐััะผ ัะตะผะฝะฐะดัะฐััั ัะตะผะฝะฐะดัะฐัะธ ัะตะผะฝะฐะดัะฐััั
+
+ะฒะพัะตะผะฝะฐะดัะฐัั ะฒะพัะตะผะฝะฐะดัะฐัะพะณะพ ะฒะพัะตะผะฝะฐะดัะฐัะพะผั ะฒะพัะตะผะฝะฐะดัะฐัะพะผ ะฒะพัะตะผะฝะฐะดัะฐััะน ะฒะพัะตะผะฝะฐะดัะฐััะผ ะฒะพัะตะผะฝะฐะดัะฐััั ะฒะพัะตะผะฝะฐะดัะฐัะธ
+ะฒะพัะตะผะฝะฐะดัะฐััั
+
+ะดะตะฒััะฝะฐะดัะฐัั ะดะตะฒััะฝะฐะดัะฐัะพะณะพ ะดะตะฒััะฝะฐะดัะฐัะพะผั ะดะตะฒััะฝะฐะดัะฐัะพะผ ะดะตะฒััะฝะฐะดัะฐััะน ะดะตะฒััะฝะฐะดัะฐััะผ ะดะตะฒััะฝะฐะดัะฐััั ะดะตะฒััะฝะฐะดัะฐัะธ
+ะดะตะฒััะฝะฐะดัะฐััั
+
+ะดะฒะฐะดัะฐัั ะดะฒะฐะดัะฐัะพะณะพ ะดะฒะฐะดัะฐัะพะผั ะดะฒะฐะดัะฐัะพะผ ะดะฒะฐะดัะฐััะน ะดะฒะฐะดัะฐััะผ ะดะฒะฐะดัะฐััั ะดะฒะฐะดัะฐัะธ ะดะฒะฐะดัะฐััั
+
+ัะตัะฒะตััะฐะบ ัะตัะฒะตััะฐะบะฐ ัะตัะฒะตััะฐะบะต ัะตัะฒะตััะฐะบั ัะตัะฒะตััะฐะบะธ ัะตัะฒะตััะฐะบะพะผ ัะตัะฒะตััะฐะบะฐะผะธ
+
+ััะธะดัะฐัั ััะธะดัะฐัะพะณะพ ััะธะดัะฐัะพะผั ััะธะดัะฐัะพะผ ััะธะดัะฐััะน ััะธะดัะฐััะผ ััะธะดัะฐััั ััะธะดัะฐัะธ ััะธะดัะฐััั
+ััะธะดัะฐะดะบะฐ ััะธะดัะฐะดะบั ััะธะดัะฐะดะบะต ััะธะดัะฐะดะบะธ ััะธะดัะฐะดะบะพะน ััะธะดัะฐะดะบะพั ััะธะดัะฐะดะบะฐะผะธ
+
+ััะธะดะตะฒััั ััะธะดะตะฒััะธ ััะธะดะตะฒัััั
+
+ัะพัะพะบ ัะพัะพะบะพะฒะพะณะพ ัะพัะพะบะพะฒะพะผั ัะพัะพะบะพะฒะพะผ ัะพัะพะบะพะฒัะผ ัะพัะพะบะพะฒะพะน ัะพัะพะบะพะฒัั
+ัะพัะพะบะตั ัะพัะพะบะตัะฐ ัะพัะพะบะตัั ัะพัะพะบะตัะต ัะพัะพะบะตัั ัะพัะพะบะตัะพะผ ัะพัะพะบะตัะฐะผะธ ัะพัะพะบะตัะฐะผ
+
+ะฟัััะดะตััั ะฟัััะดะตัััะพะณะพ ะฟัััะดะตัััะพะผั ะฟััััะดะตััััั ะฟัััะดะตัััะพะผ ะฟัััะดะตััััะน ะฟัััะดะตััััะผ ะฟััะธะดะตัััะธ ะฟัััะดะตััััั
+ะฟะพะปัะธะฝะฝะธะบ ะฟะพะปัะธะฝะฝะธะบะฐ ะฟะพะปัะธะฝะฝะธะบะต ะฟะพะปัะธะฝะฝะธะบั ะฟะพะปัะธะฝะฝะธะบะธ ะฟะพะปัะธะฝะฝะธะบะพะผ ะฟะพะปัะธะฝะฝะธะบะฐะผะธ ะฟะพะปัะธะฝะฝะธะบะฐะผ ะฟะพะปัะธะฝะฝะธะบะฐั
+ะฟััะธะดะตัััะบะฐ ะฟััะธะดะตัััะบะต ะฟััะธะดะตัััะบั ะฟััะธะดะตัััะบะธ ะฟััะธะดะตัััะบะพะน ะฟััะธะดะตัััะบะฐะผะธ ะฟััะธะดะตัััะบะฐะผ ะฟััะธะดะตัััะบะฐั
+ะฟะพะปัะพั ะฟะพะปัะพัะฐ ะฟะพะปัะพัะต ะฟะพะปัะพัั ะฟะพะปัะพัั ะฟะพะปัะพัะพะผ ะฟะพะปัะพัะฐะผะธ ะฟะพะปัะพัะฐะผ ะฟะพะปัะพัะฐั
+
+ัะตัััะดะตััั ัะตัััะดะตัััะพะณะพ ัะตัััะดะตัััะพะผั ัะตััััะดะตััััั ัะตัััะดะตัััะพะผ ัะตัััะดะตััััะน ัะตัััะดะตััััะผ ัะตััะธะดะตััััะต ัะตััะธะดะตัััะธ
+ัะตัััะดะตััััั
+
+ัะตะผัะดะตััั ัะตะผัะดะตัััะพะณะพ ัะตะผัะดะตัััะพะผั ัะตะผััะดะตััััั ัะตะผัะดะตัััะพะผ ัะตะผัะดะตััััะน ัะตะผัะดะตััััะผ ัะตะผะธะดะตัััะธ ัะตะผัะดะตััััั
+
+ะฒะพัะตะผัะดะตััั ะฒะพัะตะผัะดะตัััะพะณะพ ะฒะพัะตะผัะดะตัััะพะผั ะฒะพัะตะผััะดะตััััั ะฒะพัะตะผัะดะตัััะพะผ ะฒะพัะตะผัะดะตััััะน ะฒะพัะตะผัะดะตััััะผ ะฒะพัะตะผะธะดะตัััะธ
+ะฒะพััะผะธะดะตัััะธ ะฒะพััะผะธะดะตััััั
+
+ะดะตะฒัะฝะพััะพ ะดะตะฒัะฝะพััะพะณะพ ะดะตะฒัะฝะพััะพะผั ะดะตะฒัะฝะพััะพะผ ะดะตะฒัะฝะพัััะน ะดะตะฒัะฝะพัััะผ ะดะตะฒัะฝะพััะฐ ะดะตะฒัะฝะพัััั
+
+ััะพ ัะพัะพะณะพ ัะพัะพะผั ัะพัะพะผ ัะพัะตะฝ ัะพััะน ัะพััะผ ััะฐ
+ััะพะปัะฝะธะบ ััะพะปัะฝะธะบะฐ ััะพะปัะฝะธะบั ััะพะปัะฝะธะบะต ััะพะปัะฝะธะบะธ ััะพะปัะฝะธะบะพะผ ััะพะปัะฝะธะบะฐะผะธ
+ัะพัะบะฐ ัะพัะบะธ ัะพัะบะต ัะพัะบะพะน ัะพัะบะฐะผะธ ัะพัะบะฐะผ ัะพัะบะฐั
+ัะพัะฝั ัะพัะฝะธ ัะพัะฝะต ัะพัะฝะตะน ัะพัะฝัะผะธ ัะพัะฝัะผ ัะพัะฝัั
+
+ะดะฒะตััะธ ะดะฒัะผัััะฐะผะธ ะดะฒัั ัะพัะพะณะพ ะดะฒัั ัะพัะพะผั ะดะฒัั ัะพัะพะผ ะดะฒัั ัะพััะน ะดะฒัั ัะพััะผ ะดะฒัะผััะฐะผ ะดะฒัั ััะฐั ะดะฒัั ัะพั
+
+ััะธััะฐ ััะตะผัััะฐะผะธ ััะตั ัะพัะพะณะพ ััะตั ัะพัะพะผั ััะตั ัะพัะพะผ ััะตั ัะพััะน ััะตั ัะพััะผ ััะตะผััะฐะผ ััะตั ััะฐั ััะตั ัะพั
+
+ัะตัััะตััะฐ ัะตัััะตั ัะพัะพะณะพ ัะตัััะตั ัะพัะพะผั ัะตััััะผัััะฐะผะธ ัะตัััะตั ัะพัะพะผ ัะตัััะตั ัะพััะน ัะตัััะตั ัะพััะผ ัะตัััะตะผััะฐะผ ัะตัััะตั ััะฐั
+ัะตัััะตั ัะพั
+
+ะฟััััะพั ะฟััะธัะพัะพะณะพ ะฟััะธัะพัะพะผั ะฟััััััะฐะผะธ ะฟััะธัะพัะพะผ ะฟััะธัะพััะน ะฟััะธัะพััะผ ะฟััะธััะฐะผ ะฟััะธััะฐั ะฟััะธัะพั
+ะฟััะธัะพัะบะฐ ะฟััะธัะพัะบะธ ะฟััะธัะพัะบะต ะฟััะธัะพัะบะพะน ะฟััะธัะพัะบะฐะผะธ ะฟััะธัะพัะบะฐะผ ะฟััะธัะพัะบะพั ะฟััะธัะพัะบะฐั
+ะฟััะธั ะฐัะบะฐ ะฟััะธั ะฐัะบะธ ะฟััะธั ะฐัะบะต ะฟััะธั ะฐัะบะพะน ะฟััะธั ะฐัะบะฐะผะธ ะฟััะธั ะฐัะบะฐะผ ะฟััะธั ะฐัะบะพั ะฟััะธั ะฐัะบะฐั
+ะฟััะธัะฐะฝ ะฟััะธัะฐะฝั ะฟััะธัะฐะฝะต ะฟััะธัะฐะฝะพะผ ะฟััะธัะฐะฝะฐะผะธ ะฟััะธัะฐะฝะฐั
+
+ัะตััััะพั ัะตััะธัะพัะพะณะพ ัะตััะธัะพัะพะผั ัะตััััััะฐะผะธ ัะตััะธัะพัะพะผ ัะตััะธัะพััะน ัะตััะธัะพััะผ ัะตััะธััะฐะผ ัะตััะธััะฐั ัะตััะธัะพั
+
+ัะตะผััะพั ัะตะผะธัะพัะพะณะพ ัะตะผะธัะพัะพะผั ัะตะผััััะฐะผะธ ัะตะผะธัะพัะพะผ ัะตะผะธัะพััะน ัะตะผะธัะพััะผ ัะตะผะธััะฐะผ ัะตะผะธััะฐั ัะตะผะธัะพั
+
+ะฒะพัะตะผััะพั ะฒะพัะตะผะธัะพัะพะณะพ ะฒะพัะตะผะธัะพัะพะผั ะฒะพัะตะผะธัะพัะพะผ ะฒะพัะตะผะธัะพััะน ะฒะพัะตะผะธัะพััะผ ะฒะพััะผะธััะฐะผะธ ะฒะพััะผะธััะฐะผ ะฒะพััะผะธััะฐั ะฒะพััะผะธัะพั
+
+ะดะตะฒััััะพั ะดะตะฒััะธัะพัะพะณะพ ะดะตะฒััะธัะพัะพะผั ะดะตะฒััััััะฐะผะธ ะดะตะฒััะธัะพัะพะผ ะดะตะฒััะธัะพััะน ะดะตะฒััะธัะพััะผ ะดะตะฒััะธััะฐะผ ะดะตะฒััะธััะฐั ะดะตะฒััะธัะพั
+
+ัััััะฐ ัััััะฝะพะณะพ ัััััะฝะพะผั ัััััะฝะพะผ ัััััะฝัะน ัััััะฝัะผ ัััััะฐะผ ัััััะฐั ัััััะตะน ััััั ัััััะธ ััั
+ะบะพัะฐัั ะบะพัะฐัั ะบะพัะฐัั ะบะพัะฐัะตะผ ะบะพัะฐััะผะธ ะบะพัะฐััั ะบะพัะฐััะผ ะบะพัะฐัะตะน
+
+ะดะตัััะธัััััะฝัะน ะดะตัััะธัััััะฝะพะณะพ ะดะตัััะธัััััะฝะพะผั ะดะตัััะธัััััะฝัะผ ะดะตัััะธัััััะฝะพะผ ะดะตัััะธัััััะฝะฐั ะดะตัััะธัััััะฝะพะน
+ะดะตัััะธัััััะฝัั ะดะตัััะธัััััะฝะพั ะดะตัััะธัััััะฝะพะต ะดะตัััะธัััััะฝัะต ะดะตัััะธัััััะฝัั ะดะตัััะธัััััะฝัะผะธ
+
+ะดะฒะฐะดัะฐัะธัััััะฝัะน ะดะฒะฐะดัะฐัะธัััััะฝะพะณะพ ะดะฒะฐะดัะฐัะธัััััะฝะพะผั ะดะฒะฐะดัะฐัะธัััััะฝัะผ ะดะฒะฐะดัะฐัะธัััััะฝะพะผ ะดะฒะฐะดัะฐัะธัััััะฝะฐั
+ะดะฒะฐะดัะฐัะธัััััะฝะพะน ะดะฒะฐะดัะฐัะธัััััะฝัั ะดะฒะฐะดัะฐัะธัััััะฝะพั ะดะฒะฐะดัะฐัะธัััััะฝะพะต ะดะฒะฐะดัะฐัะธัััััะฝัะต ะดะฒะฐะดัะฐัะธัััััะฝัั
+ะดะฒะฐะดัะฐัะธัััััะฝัะผะธ
+
+ััะธะดัะฐัะธัััััะฝัะน ััะธะดัะฐัะธัััััะฝะพะณะพ ััะธะดัะฐัะธัััััะฝะพะผั ััะธะดัะฐัะธัััััะฝัะผ ััะธะดัะฐัะธัััััะฝะพะผ ััะธะดัะฐัะธัััััะฝะฐั
+ััะธะดัะฐัะธัััััะฝะพะน ััะธะดัะฐัะธัััััะฝัั ััะธะดัะฐัะธัััััะฝะพั ััะธะดัะฐัะธัััััะฝะพะต ััะธะดัะฐัะธัััััะฝัะต ััะธะดัะฐัะธัััััะฝัั
+ััะธะดัะฐัะธัััััะฝัะผะธ
+
+ัะพัะพะบะฐัััััะฝัะน ัะพัะพะบะฐัััััะฝะพะณะพ ัะพัะพะบะฐัััััะฝะพะผั ัะพัะพะบะฐัััััะฝัะผ ัะพัะพะบะฐัััััะฝะพะผ ัะพัะพะบะฐัััััะฝะฐั
+ัะพัะพะบะฐัััััะฝะพะน ัะพัะพะบะฐัััััะฝัั ัะพัะพะบะฐัััััะฝะพั ัะพัะพะบะฐัััััะฝะพะต ัะพัะพะบะฐัััััะฝัะต ัะพัะพะบะฐัััััะฝัั
+ัะพัะพะบะฐัััััะฝัะผะธ
+
+ะฟััะธะดะตัััะธัััััะฝัะน ะฟััะธะดะตัััะธัััััะฝะพะณะพ ะฟััะธะดะตัััะธัััััะฝะพะผั ะฟััะธะดะตัััะธัััััะฝัะผ ะฟััะธะดะตัััะธัััััะฝะพะผ ะฟััะธะดะตัััะธัััััะฝะฐั
+ะฟััะธะดะตัััะธัััััะฝะพะน ะฟััะธะดะตัััะธัััััะฝัั ะฟััะธะดะตัััะธัััััะฝะพั ะฟััะธะดะตัััะธัััััะฝะพะต ะฟััะธะดะตัััะธัััััะฝัะต ะฟััะธะดะตัััะธัััััะฝัั
+ะฟััะธะดะตัััะธัััััะฝัะผะธ
+
+ัะตััะธะดะตัััะธัััััะฝัะน ัะตััะธะดะตัััะธัััััะฝะพะณะพ ัะตััะธะดะตัััะธัััััะฝะพะผั ัะตััะธะดะตัััะธัััััะฝัะผ ัะตััะธะดะตัััะธัััััะฝะพะผ ัะตััะธะดะตัััะธัััััะฝะฐั
+ัะตััะธะดะตัััะธัััััะฝะพะน ัะตััะธะดะตัััะธัััััะฝัั ัะตััะธะดะตัััะธัััััะฝะพั ัะตััะธะดะตัััะธัััััะฝะพะต ัะตััะธะดะตัััะธัััััะฝัะต ัะตััะธะดะตัััะธัััััะฝัั
+ัะตััะธะดะตัััะธัััััะฝัะผะธ
+
+ัะตะผะธะดะตัััะธัััััะฝัะน ัะตะผะธะดะตัััะธัััััะฝะพะณะพ ัะตะผะธะดะตัััะธัััััะฝะพะผั ัะตะผะธะดะตัััะธัััััะฝัะผ ัะตะผะธะดะตัััะธัััััะฝะพะผ ัะตะผะธะดะตัััะธัััััะฝะฐั
+ัะตะผะธะดะตัััะธัััััะฝะพะน ัะตะผะธะดะตัััะธัััััะฝัั ัะตะผะธะดะตัััะธัััััะฝะพั ัะตะผะธะดะตัััะธัััััะฝะพะต ัะตะผะธะดะตัััะธัััััะฝัะต ัะตะผะธะดะตัััะธัััััะฝัั
+ัะตะผะธะดะตัััะธัััััะฝัะผะธ
+
+ะฒะพััะผะธะดะตัััะธัััััะฝัะน ะฒะพััะผะธะดะตัััะธัััััะฝะพะณะพ ะฒะพััะผะธะดะตัััะธัััััะฝะพะผั ะฒะพััะผะธะดะตัััะธัััััะฝัะผ ะฒะพััะผะธะดะตัััะธัััััะฝะพะผ ะฒะพััะผะธะดะตัััะธัััััะฝะฐั
+ะฒะพััะผะธะดะตัััะธัััััะฝะพะน ะฒะพััะผะธะดะตัััะธัััััะฝัั ะฒะพััะผะธะดะตัััะธัััััะฝะพั ะฒะพััะผะธะดะตัััะธัััััะฝะพะต ะฒะพััะผะธะดะตัััะธัััััะฝัะต ะฒะพััะผะธะดะตัััะธัััััะฝัั
+ะฒะพััะผะธะดะตัััะธัััััะฝัะผะธ
+
+ััะพัััััะฝัะน ััะพัััััะฝะพะณะพ ััะพัััััะฝะพะผั ััะพัััััะฝัะผ ััะพัััััะฝะพะผ ััะพัััััะฝะฐั ััะพัััััะฝะพะน ััะพัััััะฝัั ััะพัััััะฝะพะต
+ััะพัััััะฝัะต ััะพัััััะฝัั ััะพัััััะฝัะผะธ ััะพัััััะฝะพั
+
+ะผะธะปะปะธะพะฝ ะผะธะปะปะธะพะฝะฝะพะณะพ ะผะธะปะปะธะพะฝะพะฒ ะผะธะปะปะธะพะฝะฝะพะผั ะผะธะปะปะธะพะฝะฝะพะผ ะผะธะปะปะธะพะฝะฝัะน ะผะธะปะปะธะพะฝะฝัะผ ะผะธะปะปะธะพะฝะพะผ ะผะธะปะปะธะพะฝะฐ ะผะธะปะปะธะพะฝะต ะผะธะปะปะธะพะฝั
+ะผะธะปะปะธะพะฝะพะฒ
+ะปัะผ ะปัะผะฐ ะปัะผั ะปัะผะพะผ ะปัะผะฐะผะธ ะปัะผะฐั ะปัะผะพะฒ
+ะผะปะฝ
+
+ะดะตัััะธะผะธะปะปะธะพะฝะฝะฐั ะดะตัััะธะผะธะปะปะธะพะฝะฝะพะน ะดะตัััะธะผะธะปะปะธะพะฝะฝัะผะธ ะดะตัััะธะผะธะปะปะธะพะฝะฝัะน ะดะตัััะธะผะธะปะปะธะพะฝะฝัะผ ะดะตัััะธะผะธะปะปะธะพะฝะฝะพะผั
+ะดะตัััะธะผะธะปะปะธะพะฝะฝัะผะธ ะดะตัััะธะผะธะปะปะธะพะฝะฝัั ะดะตัััะธะผะธะปะปะธะพะฝะฝะพะต ะดะตัััะธะผะธะปะปะธะพะฝะฝัะต ะดะตัััะธะผะธะปะปะธะพะฝะฝัั ะดะตัััะธะผะธะปะปะธะพะฝะฝะพั
+
+ะผะธะปะปะธะฐัะด ะผะธะปะปะธะฐัะดะฝะพะณะพ ะผะธะปะปะธะฐัะดะฝะพะผั ะผะธะปะปะธะฐัะดะฝะพะผ ะผะธะปะปะธะฐัะดะฝัะน ะผะธะปะปะธะฐัะดะฝัะผ ะผะธะปะปะธะฐัะดะพะผ ะผะธะปะปะธะฐัะดะฐ ะผะธะปะปะธะฐัะดะต ะผะธะปะปะธะฐัะดั
+ะผะธะปะปะธะฐัะดะพะฒ
+ะปััะด ะปััะดะฐ ะปััะดั ะปััะดะพะผ ะปััะดะฐะผะธ ะปััะดะฐั ะปััะดะพะฒ
+ะผะปัะด
+
+ััะธะปะปะธะพะฝ ััะธะปะปะธะพะฝะฝะพะณะพ ััะธะปะปะธะพะฝะฝะพะผั ััะธะปะปะธะพะฝะฝะพะผ ััะธะปะปะธะพะฝะฝัะน ััะธะปะปะธะพะฝะฝัะผ ััะธะปะปะธะพะฝะพะผ ััะธะปะปะธะพะฝะฐ ััะธะปะปะธะพะฝะต ััะธะปะปะธะพะฝั
+ััะธะปะปะธะพะฝะพะฒ ััะปะฝ
+
+ะบะฒะฐะดัะธะปะปะธะพะฝ ะบะฒะฐะดัะธะปะปะธะพะฝะฝะพะณะพ ะบะฒะฐะดัะธะปะปะธะพะฝะฝะพะผั ะบะฒะฐะดัะธะปะปะธะพะฝะฝัะน ะบะฒะฐะดัะธะปะปะธะพะฝะฝัะผ ะบะฒะฐะดัะธะปะปะธะพะฝะพะผ ะบะฒะฐะดัะธะปะปะธะพะฝะฐ ะบะฒะฐะดัะธะปะปะธะพะฝะต
+ะบะฒะฐะดัะธะปะปะธะพะฝั ะบะฒะฐะดัะธะปะปะธะพะฝะพะฒ ะบะฒะฐะดัะปะฝ
+
+ะบะฒะธะฝัะธะปะปะธะพะฝ ะบะฒะธะฝัะธะปะปะธะพะฝะฝะพะณะพ ะบะฒะธะฝัะธะปะปะธะพะฝะฝะพะผั ะบะฒะธะฝัะธะปะปะธะพะฝะฝัะน ะบะฒะธะฝัะธะปะปะธะพะฝะฝัะผ ะบะฒะธะฝัะธะปะปะธะพะฝะพะผ ะบะฒะธะฝัะธะปะปะธะพะฝะฐ ะบะฒะธะฝัะธะปะปะธะพะฝะต
+ะบะฒะธะฝัะธะปะปะธะพะฝั ะบะฒะธะฝัะธะปะปะธะพะฝะพะฒ ะบะฒะธะฝัะปะฝ
+
+i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii xviii xix xx xxi xxii xxiii xxiv xxv xxvi xxvii xxvii xxix
+""".split()
+ )
+)
def like_num(text):
if text.startswith(("+", "-", "ยฑ", "~")):
text = text[1:]
+ if text.endswith("%"):
+ text = text[:-1]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
diff --git a/spacy/lang/ru/stop_words.py b/spacy/lang/ru/stop_words.py
index 16cb55ef9..d6ea6b42a 100644
--- a/spacy/lang/ru/stop_words.py
+++ b/spacy/lang/ru/stop_words.py
@@ -1,52 +1,111 @@
STOP_WORDS = set(
"""
-ะฐ
+ะฐ ะฐะฒะพัั ะฐะณะฐ ะฐะณั ะฐะถ ะฐะน ะฐะปะธ ะฐะปะปะพ ะฐั ะฐั ะฐั
-ะฑัะดะตะผ ะฑัะดะตั ะฑัะดะตัะต ะฑัะดะตัั ะฑัะดั ะฑัะดัั ะฑัะดััะธ ะฑัะดั ะฑัะดััะต ะฑั ะฑัะป ะฑัะปะฐ ะฑัะปะธ ะฑัะปะพ
-ะฑััั
+ะฑ ะฑัะดะตะผ ะฑัะดะตั ะฑัะดะตัะต ะฑัะดะตัั ะฑัะดั ะฑัะดัั ะฑัะดััะธ ะฑัะดั ะฑัะดััะต ะฑั ะฑัะป ะฑัะปะฐ ะฑัะปะธ ะฑัะปะพ
+ะฑััั ะฑะฐั ะฑะตะท ะฑะตะทััะปะพะฒะฝะพ ะฑะธัั ะฑะปะฐะณะพ ะฑะปะฐะณะพะดะฐัั ะฑะปะธะถะฐะนัะธะต ะฑะปะธะทะบะพ ะฑะพะปะตะต ะฑะพะปััะต
+ะฑัะดัะพ ะฑัะฒะฐะตั ะฑัะฒะฐะปะฐ ะฑัะฒะฐะปะธ ะฑัะฒะฐั ะฑัะฒะฐัั ะฑัััะตั
ะฒ ะฒะฐะผ ะฒะฐะผะธ ะฒะฐั ะฒะตัั ะฒะพ ะฒะพั ะฒัะต ะฒัั ะฒัะตะณะพ ะฒัะตะน ะฒัะตะผ ะฒััะผ ะฒัะตะผะธ ะฒัะตะผั ะฒัะตั ะฒัะตั
-ะฒัะตั ะฒัั ะฒัั ะฒั
+ะฒัะตั ะฒัั ะฒัั ะฒั ะฒะฐั ะฒะฐัะฐ ะฒะฐัะต ะฒะฐัะธ ะฒะดะฐะปะธ ะฒะดะพะฑะฐะฒะพะบ ะฒะดััะณ ะฒะตะดั ะฒะตะทะดะต ะฒะตัะฝะตะต
+ะฒะทะฐะธะผะฝะพ ะฒะทะฐะฟัะฐะฒะดั ะฒะธะดะฝะพ ะฒะธัั ะฒะบะปััะฐั ะฒะผะตััะพ ะฒะฝะฐะบะปะฐะดะต ะฒะฝะฐัะฐะปะต ะฒะฝะต ะฒะฝะธะท ะฒะฝะธะทั
+ะฒะฝะพะฒั ะฒะพะฒัะต ะฒะพะทะผะพะถะฝะพ ะฒะพะธััะธะฝั ะฒะพะบััะณ ะฒะพะฝ ะฒะพะพะฑัะต ะฒะพะฟัะตะบะธ ะฒะฟะตัะตะบะพั ะฒะฟะปะพัั
+ะฒะฟะพะปะฝะต ะฒะฟัะฐะฒะดั ะฒะฟัะฐะฒะต ะฒะฟัะพัะตะผ ะฒะฟััะผั ะฒัะตัะฝะพัั ะฒัะพะดะต ะฒััะด ะฒัะตะณะดะฐ ะฒััะดั
+ะฒััะบะธะน ะฒััะบะพะณะพ ะฒััะบะพะน ะฒัััะตัะบะธ ะฒัะตัะตะด
-ะดะฐ ะดะปั ะดะพ
+ะณ ะณะพ ะณะดะต ะณะพัะฐะทะดะพ ะณะฐะฒ
-ะตะณะพ ะตะดะธะผ ะตะดัั ะตะต ะตั ะตะน ะตะป ะตะปะฐ ะตะผ ะตะผั ะตะผั ะตัะปะธ ะตัั ะตััั ะตัั ะตัะต ะตัั ะตั
+ะด ะดะฐ ะดะปั ะดะพ ะดะฐะฑั ะดะฐะฒะฐะนัะต ะดะฐะฒะฝะพ ะดะฐะฒะฝัะผ ะดะฐะถะต ะดะฐะปะตะต ะดะฐะปะตะบะพ ะดะฐะปััะต ะดะฐะฝะฝะฐั
+ะดะฐะฝะฝะพะณะพ ะดะฐะฝะฝะพะต ะดะฐะฝะฝะพะน ะดะฐะฝะฝะพะผ ะดะฐะฝะฝะพะผั ะดะฐะฝะฝัะต ะดะฐะฝะฝัะน ะดะฐะฝะฝัั ะดะฐะฝั ะดะฐะฝัะฝะฐั
+ะดะฐัะพะผ ะดะต ะดะตะนััะฒะธัะตะปัะฝะพ ะดะพะฒะพะปัะฝะพ ะดะพะบะพะปะต ะดะพะบะพะปั ะดะพะปะณะพ ะดะพะปะถะตะฝ ะดะพะปะถะฝะฐ
+ะดะพะปะถะฝะพ ะดะพะปะถะฝั ะดะพะปะถะฝัะน ะดะพะฟะพะปะฝะธัะตะปัะฝะพ ะดััะณะฐั ะดััะณะธะต ะดััะณะธะผ ะดััะณะธะผะธ
+ะดััะณะธั ะดััะณะพะต ะดััะณะพะน
-ะถะต
+ะต ะตะณะพ ะตะดะธะผ ะตะดัั ะตะต ะตั ะตะน ะตะป ะตะปะฐ ะตะผ ะตะผั ะตะผั ะตัะปะธ ะตัั ะตััั ะตัั ะตัะต ะตัั ะตั ะตะดะฒะฐ
+ะตะถะตะปะธ ะตะปะต
-ะทะฐ
+ะถ ะถะต
-ะธ ะธะท ะธะปะธ ะธะผ ะธะผะธ ะธะผั ะธั
+ะท ะทะฐ ะทะฐัะตะผ ะทะฐัะพ ะทะฐัะตะผ ะทะดะตัั ะทะฝะฐัะธั ะทัั
+
+ะธ ะธะท ะธะปะธ ะธะผ ะธะผะธ ะธะผั ะธั ะธะฑะพ ะธะปั ะธะผะตะตั ะธะผะตะป ะธะผะตะปะฐ ะธะผะตะปะพ ะธะผะตะฝะฝะพ ะธะผะตัั ะธะฝะฐัะต
+ะธะฝะพะณะดะฐ ะธะฝัะผ ะธะฝัะผะธ ะธัะฐะบ ะธัั
+
+ะน
ะบ ะบะฐะบ ะบะตะผ ะบะพ ะบะพะณะดะฐ ะบะพะณะพ ะบะพะผ ะบะพะผั ะบะพะผัั ะบะพัะพัะฐั ะบะพัะพัะพะณะพ ะบะพัะพัะพะต ะบะพัะพัะพะน ะบะพัะพัะพะผ
-ะบะพัะพัะพะผั ะบะพัะพัะพั ะบะพัะพััั ะบะพัะพััะต ะบะพัะพััะน ะบะพัะพััะผ ะบะพัะพััะผะธ ะบะพัะพััั ะบัะพ
+ะบะพัะพัะพะผั ะบะพัะพัะพั ะบะพัะพััั ะบะพัะพััะต ะบะพัะพััะน ะบะพัะพััะผ ะบะพัะพััะผะธ ะบะพัะพััั ะบัะพ ะบะฐ ะบะฐะฑั
+ะบะฐะถะดะฐั ะบะฐะถะดะพะต ะบะฐะถะดัะต ะบะฐะถะดัะน ะบะฐะถะตััั ะบะฐะทะฐะปะฐัั ะบะฐะทะฐะปะธัั ะบะฐะทะฐะปะพัั ะบะฐะทะฐะปัั ะบะฐะทะฐัััั
+ะบะฐะบะฐั ะบะฐะบะธะต ะบะฐะบะธะผ ะบะฐะบะธะผะธ ะบะฐะบะพะฒ ะบะฐะบะพะณะพ ะบะฐะบะพะน ะบะฐะบะพะผั ะบะฐะบะพั ะบะฐัะฐัะตะปัะฝะพ ะบะพะน ะบะพะปะธ
+ะบะพะปั ะบะพะฝะตัะฝะพ ะบะพัะพัะต ะบัะพะผะต ะบััะฐัะธ ะบั ะบัะดะฐ
-ะผะตะฝั ะผะฝะต ะผะฝะพะน ะผะฝะพั ะผะพะณ ะผะพะณะธ ะผะพะณะธัะต ะผะพะณะปะฐ ะผะพะณะปะธ ะผะพะณะปะพ ะผะพะณั ะผะพะณัั ะผะพะต ะผะพั ะผะพะตะณะพ
+ะป ะปะธ ะปะธะฑะพ ะปะธัั ะปัะฑะฐั ะปัะฑะพะณะพ ะปัะฑะพะต ะปัะฑะพะน ะปัะฑะพะผ ะปัะฑัั ะปัะฑัะผะธ ะปัะฑัั
+
+ะผ ะผะตะฝั ะผะฝะต ะผะฝะพะน ะผะฝะพั ะผะพะณ ะผะพะณะธ ะผะพะณะธัะต ะผะพะณะปะฐ ะผะพะณะปะธ ะผะพะณะปะพ ะผะพะณั ะผะพะณัั ะผะพะต ะผะพั ะผะพะตะณะพ
ะผะพะตะน ะผะพะตะผ ะผะพัะผ ะผะพะตะผั ะผะพะตั ะผะพะถะตะผ ะผะพะถะตั ะผะพะถะตัะต ะผะพะถะตัั ะผะพะธ ะผะพะน ะผะพะธะผ ะผะพะธะผะธ ะผะพะธั
-ะผะพัั ะผะพั ะผะพั ะผั
+ะผะพัั ะผะพั ะผะพั ะผั ะผะฐะปะพ ะผะตะถ ะผะตะถะดั ะผะตะฝะตะต ะผะตะฝััะต ะผะธะผะพ ะผะฝะพะณะธะต ะผะฝะพะณะพ ะผะฝะพะณะพะณะพ ะผะฝะพะณะพะต
+ะผะฝะพะณะพะผ ะผะฝะพะณะพะผั ะผะพะถะฝะพ ะผะพะป ะผั
-ะฝะฐ ะฝะฐะผ ะฝะฐะผะธ ะฝะฐั ะฝะฐัะฐ ะฝะฐั ะฝะฐัะฐ ะฝะฐัะต ะฝะฐัะตะณะพ ะฝะฐัะตะน ะฝะฐัะตะผ ะฝะฐัะตะผั ะฝะฐัะตั ะฝะฐัะธ ะฝะฐัะธะผ
+ะฝ ะฝะฐ ะฝะฐะผ ะฝะฐะผะธ ะฝะฐั ะฝะฐัะฐ ะฝะฐั ะฝะฐัะฐ ะฝะฐัะต ะฝะฐัะตะณะพ ะฝะฐัะตะน ะฝะฐัะตะผ ะฝะฐัะตะผั ะฝะฐัะตั ะฝะฐัะธ ะฝะฐัะธะผ
ะฝะฐัะธะผะธ ะฝะฐัะธั ะฝะฐัั ะฝะต ะฝะตะณะพ ะฝะตะต ะฝะตั ะฝะตะน ะฝะตะผ ะฝัะผ ะฝะตะผั ะฝะตั ะฝะตั ะฝะธะผ ะฝะธะผะธ ะฝะธั ะฝะพ
+ะฝะฐะฒะตัะฝัะบะฐ ะฝะฐะฒะตัั ั ะฝะฐะฒััะด ะฝะฐะฒัะฒะพัะพั ะฝะฐะด ะฝะฐะดะพ ะฝะฐะทะฐะด ะฝะฐะธะฑะพะปะตะต ะฝะฐะธะทะฒะพัะพั
+ะฝะฐะธะทะฝะฐะฝะบั ะฝะฐะธะฟะฐัะต ะฝะฐะบะฐะฝัะฝะต ะฝะฐะบะพะฝะตั ะฝะฐะพะฑะพัะพั ะฝะฐะฟะตัะตะด ะฝะฐะฟะตัะตะบะพั ะฝะฐะฟะพะดะพะฑะธะต
+ะฝะฐะฟัะธะผะตั ะฝะฐะฟัะพัะธะฒ ะฝะฐะฟััะผัั ะฝะฐัะธะปั ะฝะฐััะพััะฐั ะฝะฐััะพััะตะต ะฝะฐััะพััะธะต ะฝะฐััะพััะธะน
+ะฝะฐััะตั ะฝะฐัะต ะฝะฐั ะพะดะธัััั ะฝะฐัะฐะปะฐ ะฝะฐัะฐะปะต ะฝะตะฒะฐะถะฝะพ ะฝะตะณะดะต ะฝะตะดะฐะฒะฝะพ ะฝะตะดะฐะปะตะบะพ ะฝะตะทะฐัะตะผ
+ะฝะตะบะตะผ ะฝะตะบะพะณะดะฐ ะฝะตะบะพะผั ะฝะตะบะพัะพัะฐั ะฝะตะบะพัะพััะต ะฝะตะบะพัะพััะน ะฝะตะบะพัะพััั ะฝะตะบัะพ ะฝะตะบัะดะฐ
+ะฝะตะปัะทั ะฝะตะผะฝะพะณะธะต ะฝะตะผะฝะพะณะธะผ ะฝะตะผะฝะพะณะพ ะฝะตะพะฑั ะพะดะธะผะพ ะฝะตะพะฑั ะพะดะธะผะพััะธ ะฝะตะพะฑั ะพะดะธะผัะต
+ะฝะตะพะฑั ะพะดะธะผัะผ ะฝะตะพัะบัะดะฐ ะฝะตะฟัะตััะฒะฝะพ ะฝะตัะตะดะบะพ ะฝะตัะบะพะปัะบะพ ะฝะตัั ะฝะตัะถะตะปะธ ะฝะตัะตะณะพ
+ะฝะตัะตะผ ะฝะตัะตะผั ะฝะตััะพ ะฝะตััะพ ะฝะธะฑัะดั ะฝะธะณะดะต ะฝะธะถะต ะฝะธะทะบะพ ะฝะธะบะฐะบ ะฝะธะบะฐะบะพะน ะฝะธะบะตะผ
+ะฝะธะบะพะณะดะฐ ะฝะธะบะพะณะพ ะฝะธะบะพะผั ะฝะธะบัะพ ะฝะธะบัะดะฐ ะฝะธะพัะบัะดะฐ ะฝะธะฟะพัะตะผ ะฝะธัะตะณะพ ะฝะธัะตะผ ะฝะธัะตะผั
+ะฝะธััะพ ะฝั ะฝัะถะฝะฐั ะฝัะถะฝะพ ะฝัะถะฝะพะณะพ ะฝัะถะฝัะต ะฝัะถะฝัะน ะฝัะถะฝัั ะฝัะฝะต ะฝัะฝะตัะฝะตะต ะฝัะฝะตัะฝะตะน
+ะฝัะฝะตัะฝะธั ะฝัะฝัะต
ะพ ะพะฑ ะพะดะธะฝ ะพะดะฝะฐ ะพะดะฝะธ ะพะดะฝะธะผ ะพะดะฝะธะผะธ ะพะดะฝะธั ะพะดะฝะพ ะพะดะฝะพะณะพ ะพะดะฝะพะน ะพะดะฝะพะผ ะพะดะฝะพะผั ะพะดะฝะพั
-ะพะดะฝั ะพะฝ ะพะฝะฐ ะพะฝะต ะพะฝะธ ะพะฝะพ ะพั
+ะพะดะฝั ะพะฝ ะพะฝะฐ ะพะฝะต ะพะฝะธ ะพะฝะพ ะพั ะพะฑะฐ ะพะฑััั ะพะฑััะฝะพ ะพะณะพ ะพะดะฝะฐะถะดั ะพะดะฝะฐะบะพ ะพะน ะพะบะพะปะพ ะพะฝัะน
+ะพะฟ ะพะฟััั ะพัะพะฑะตะฝะฝะพ ะพัะพะฑะพ ะพัะพะฑัั ะพัะพะฑัะต ะพัะบัะดะฐ ะพัะฝะตะปะธะถะฐ ะพัะฝะตะปะธะถะต ะพัะพะฒััะดั
+ะพัััะดะฐ ะพััะพะณะพ ะพััะพั ะพัััะดะฐ ะพััะตะณะพ ะพััะตะผั ะพั ะพัะตะฒะธะดะฝะพ ะพัะตะฝั ะพะผ
-ะฟะพ ะฟัะธ
+ะฟ ะฟะพ ะฟัะธ ะฟะฐัะต ะฟะตัะตะด ะฟะพะด ะฟะพะดะฐะฒะฝะพ ะฟะพะดะธ ะฟะพะดะพะฑะฝะฐั ะฟะพะดะพะฑะฝะพ ะฟะพะดะพะฑะฝะพะณะพ ะฟะพะดะพะฑะฝัะต
+ะฟะพะดะพะฑะฝัะน ะฟะพะดะพะฑะฝัะผ ะฟะพะดะพะฑะฝัั ะฟะพะตะปะธะบั ะฟะพะถะฐะปัะน ะฟะพะถะฐะปัะนััะฐ ะฟะพะทะถะต ะฟะพะธััะธะฝะต
+ะฟะพะบะฐ ะฟะพะบะฐะผะตัั ะฟะพะบะพะปะต ะฟะพะบะพะปั ะฟะพะบัะดะฐ ะฟะพะบัะดะพะฒะฐ ะฟะพะผะธะผะพ ะฟะพะฝะตะถะต ะฟะพะฟัะธัะต ะฟะพั
+ะฟะพัะฐ ะฟะพัะตะผั ะฟะพัะบะพะปัะบั ะฟะพัะปะต ะฟะพััะตะดะธ ะฟะพััะตะดััะฒะพะผ ะฟะพัะพะผ ะฟะพัะพะผั ะฟะพัะพะผัััะฐ
+ะฟะพั ะพะถะตะผ ะฟะพัะตะผั ะฟะพััะธ ะฟะพััะพะผั ะฟัะตะถะดะต ะฟัะธัะพะผ ะฟัะธัะตะผ ะฟัะพ ะฟัะพััะพ ะฟัะพัะตะณะพ
+ะฟัะพัะตะต ะฟัะพัะตะผั ะฟัะพัะธะผะธ ะฟัะพัะต ะฟััะผ ะฟัััั
+
+ั ัะฐะดะธ ัะฐะทะฒะต ัะฐะฝะตะต ัะฐะฝะพ ัะฐะฝััะต ััะดะพะผ
ั ัะฐะผ ัะฐะผะฐ ัะฐะผะธ ัะฐะผะธะผ ัะฐะผะธะผะธ ัะฐะผะธั ัะฐะผะพ ัะฐะผะพะณะพ ัะฐะผะพะผ ัะฐะผะพะผั ัะฐะผั ัะฒะพะต ัะฒะพั
ัะฒะพะตะณะพ ัะฒะพะตะน ัะฒะพะตะผ ัะฒะพัะผ ัะฒะพะตะผั ัะฒะพะตั ัะฒะพะธ ัะฒะพะน ัะฒะพะธะผ ัะฒะพะธะผะธ ัะฒะพะธั ัะฒะพั ัะฒะพั
-ัะตะฑะต ัะตะฑั ัะพะฑะพะน ัะพะฑะพั
+ัะตะฑะต ัะตะฑั ัะพะฑะพะน ัะพะฑะพั ัะฐะผะฐั ัะฐะผะพะต ัะฐะผะพะน ัะฐะผัะน ัะฐะผัั ัะฒะตัั ัะฒััะต ัะต ัะตะณะพ ัะตะน
+ัะตะนัะฐั ัะธะต ัะธั ัะบะฒะพะทั ัะบะพะปัะบะพ ัะบะพัะตะต ัะบะพัะพ ัะปะตะดัะตั ัะปะธัะบะพะผ ัะผะพะณัั ัะผะพะถะตั
+ัะฝะฐัะฐะปะฐ ัะฝะพะฒะฐ ัะพ ัะพะฑััะฒะตะฝะฝะพ ัะพะฒัะตะผ ัะฟะตัะฒะฐ ัะฟะพะบะพะฝั ัะฟัััั ััะฐะทั ััะตะดะธ ััะพะดะฝะธ
+ััะฐะป ััะฐะปะฐ ััะฐะปะธ ััะฐะปะพ ััะฐัั ัััั ััะทะฝะพะฒะฐ
-ัะฐ ัะฐะบ ัะฐะบะฐั ัะฐะบะธะต ัะฐะบะธะผ ัะฐะบะธะผะธ ัะฐะบะธั ัะฐะบะพะณะพ ัะฐะบะพะต ัะฐะบะพะน ัะฐะบะพะผ ัะฐะบะพะผั ัะฐะบะพั
-ัะฐะบัั ัะต ัะตะฑะต ัะตะฑั ัะตะผ ัะตะผะธ ัะตั ัะพ ัะพะฑะพะน ัะพะฑะพั ัะพะณะพ ัะพะน ัะพะปัะบะพ ัะพะผ ัะพะผะฐั ัะพะผั
-ัะพั ัะพั ัั ัั
+ัะฐ ัะพ ัั ัั ัะธ ัะฐะบ ัะฐะบะฐั ัะฐะบะธะต ัะฐะบะธะผ ัะฐะบะธะผะธ ัะฐะบะธั ัะฐะบะพะณะพ ัะฐะบะพะต ัะฐะบะพะน ัะฐะบะพะผ ัะฐะบะพะผั ัะฐะบะพั
+ัะฐะบัั ัะต ัะตะฑะต ัะตะฑั ัะตะผ ัะตะผะธ ัะตั ัะพะฑะพะน ัะพะฑะพั ัะพะณะพ ัะพะน ัะพะปัะบะพ ัะพะผ ัะพะผะฐั ัะพะผั
+ัะพั ัะพั ัะฐะบะถะต ัะฐะบะธ ัะฐะบะพะฒ ัะฐะบะพะฒะฐ ัะฐะผ ัะฒะพะธ ัะฒะพะธะผ ัะฒะพะธั ัะฒะพะน ัะฒะพั ัะฒะพั
+ัะตะฟะตัั ัะพะณะดะฐ ัะพะถะต ัะพััะฐั ัะพัะฝะพ ััะดะฐ ััั ัััั ัะฐั
-ั ัะถะต
+ั ัะถะต ัะฒั ัะถ ััะฐ ัั ัั
-ัะตะณะพ ัะตะผ ััะผ ัะตะผั ััะพ ััะพะฑั
+ั ัั
-ััะฐ ััะธ ััะธะผ ััะธะผะธ ััะธั ััะพ ััะพะณะพ ััะพะน ััะพะผ ััะพะผั ััะพั ััะพั ััั
+ั ั ะฐ ั ะต ั ะพัะพัะพ ั ะพัะตะป ั ะพัะตะปะฐ ั ะพัะตะปะพัั ั ะพัะตัั ั ะพัั ั ะพัั ั ะพัะตัั ั ะพัั ั ัะถะต
-ั
+ั ัะตะณะพ ัะตะผ ััะผ ัะตะผั ััะพ ััะพะฑั ัะฐััะพ ัะฐัะต ัะตะน ัะตัะตะท ััะพะฑ ัััั ัั ะฐัั ััะธะผ
+ััะธั ััั ัั
+
+ั ัะฐ
+
+ั ัะฐ ัะฐั
+
+ั ัั ัะต ัะน
+
+ั ััะฐ ััะธ ััะธะผ ััะธะผะธ ััะธั ััะพ ััะพะณะพ ััะพะน ััะพะผ ััะพะผั ััะพั ััะพั ััั ัะดะฐะบ ัะดะฐะบะธะน
+ัะน ัะบะฐ ัะบะธะน ััะฐะบ ััะฐะบะธะน ัั
+
+ั
+
+ั ัะฒะฝะพ ัะฒะฝัั ัะบะพ ัะบะพะฑั ัะบะพะถะต
""".split()
)
diff --git a/spacy/lang/ru/tokenizer_exceptions.py b/spacy/lang/ru/tokenizer_exceptions.py
index 1dc363fae..f3756e26c 100644
--- a/spacy/lang/ru/tokenizer_exceptions.py
+++ b/spacy/lang/ru/tokenizer_exceptions.py
@@ -2,7 +2,6 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...symbols import ORTH, NORM
from ...util import update_exc
-
_exc = {}
_abbrev_exc = [
@@ -42,7 +41,6 @@ _abbrev_exc = [
{ORTH: "ะดะตะบ", NORM: "ะดะตะบะฐะฑัั"},
]
-
for abbrev_desc in _abbrev_exc:
abbrev = abbrev_desc[ORTH]
for orth in (abbrev, abbrev.capitalize(), abbrev.upper()):
@@ -50,17 +48,354 @@ for abbrev_desc in _abbrev_exc:
_exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}]
-_slang_exc = [
+for abbr in [
+ # Year slang abbreviations
{ORTH: "2ะบ15", NORM: "2015"},
{ORTH: "2ะบ16", NORM: "2016"},
{ORTH: "2ะบ17", NORM: "2017"},
{ORTH: "2ะบ18", NORM: "2018"},
{ORTH: "2ะบ19", NORM: "2019"},
{ORTH: "2ะบ20", NORM: "2020"},
-]
+ {ORTH: "2ะบ21", NORM: "2021"},
+ {ORTH: "2ะบ22", NORM: "2022"},
+ {ORTH: "2ะบ23", NORM: "2023"},
+ {ORTH: "2ะบ24", NORM: "2024"},
+ {ORTH: "2ะบ25", NORM: "2025"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
-for slang_desc in _slang_exc:
- _exc[slang_desc[ORTH]] = [slang_desc]
+for abbr in [
+ # Profession and academic titles abbreviations
+ {ORTH: "ะฐะบ.", NORM: "ะฐะบะฐะดะตะผะธะบ"},
+ {ORTH: "ะฐะบะฐะด.", NORM: "ะฐะบะฐะดะตะผะธะบ"},
+ {ORTH: "ะด-ั ะฐัั ะธัะตะบัััั", NORM: "ะดะพะบัะพั ะฐัั ะธัะตะบัััั"},
+ {ORTH: "ะด-ั ะฑะธะพะป. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฑะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะฒะตัะตัะธะฝะฐั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฒะตัะตัะธะฝะฐัะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะฒะพะตะฝ. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฒะพะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะณะตะพะณั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะณะตะพะป.-ะผะธะฝะตัะฐะป. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะณะตะพะปะพะณะพ-ะผะธะฝะตัะฐะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะธัะบััััะฒะพะฒะตะดะตะฝะธั", NORM: "ะดะพะบัะพั ะธัะบััััะฒะพะฒะตะดะตะฝะธั"},
+ {ORTH: "ะด-ั ะธัั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะธััะพัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะบัะปััััะพะปะพะณะธะธ", NORM: "ะดะพะบัะพั ะบัะปััััะพะปะพะณะธะธ"},
+ {ORTH: "ะด-ั ะผะตะด. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะผะตะดะธัะธะฝัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะฟะตะด. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฟะตะดะฐะณะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะฟะพะปะธั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฟะพะปะธัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ะฟัะธั ะพะป. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ะฟัะธั ะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ั.-ั . ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะตะปััะบะพั ะพะทัะนััะฒะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะพัะธะพะป. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะตั ะฝ. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะตั ะฝะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะฐัะผะฐัะตะฒั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะฐัะผะฐัะตะฒัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะธะท.-ะผะฐั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะธะทะธะบะพ-ะผะฐัะตะผะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะธะปะพะป. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะธะปะพั. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ั ะธะผ. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ั ะธะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ัะบะพะฝ. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั ััะธะด. ะฝะฐัะบ", NORM: "ะดะพะบัะพั ััะธะดะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด-ั", NORM: "ะดะพะบัะพั"},
+ {ORTH: "ะด.ะฑ.ะฝ.", NORM: "ะดะพะบัะพั ะฑะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะณ.-ะผ.ะฝ.", NORM: "ะดะพะบัะพั ะณะตะพะปะพะณะพ-ะผะธะฝะตัะฐะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะณ.ะฝ.", NORM: "ะดะพะบัะพั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะธ.ะฝ.", NORM: "ะดะพะบัะพั ะธััะพัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะธัะบ.", NORM: "ะดะพะบัะพั ะธัะบััััะฒะพะฒะตะดะตะฝะธั"},
+ {ORTH: "ะด.ะผ.ะฝ.", NORM: "ะดะพะบัะพั ะผะตะดะธัะธะฝัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะฟ.ะฝ.", NORM: "ะดะพะบัะพั ะฟัะธั ะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะฟะตะด.ะฝ.", NORM: "ะดะพะบัะพั ะฟะตะดะฐะณะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ะฟะพะปะธั.ะฝ.", NORM: "ะดะพะบัะพั ะฟะพะปะธัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.-ั .ะฝ.", NORM: "ะดะพะบัะพั ัะตะปััะบะพั ะพะทัะนััะฒะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะด.ัะพัะธะพะป.ะฝ.", NORM: "ะดะพะบัะพั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ.", NORM: "ะดะพะบัะพั ัะตั ะฝะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ", NORM: "ะดะพะบัะพั ัะตั ะฝะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.-ะผ.ะฝ.", NORM: "ะดะพะบัะพั ัะธะทะธะบะพ-ะผะฐัะตะผะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ.", NORM: "ะดะพะบัะพั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ัะธะปะพั.ะฝ.", NORM: "ะดะพะบัะพั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ัะธะป.ะฝ.", NORM: "ะดะพะบัะพั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั .ะฝ.", NORM: "ะดะพะบัะพั ั ะธะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ.", NORM: "ะดะพะบัะพั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ", NORM: "ะดะพะบัะพั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะด.ั.ะฝ.", NORM: "ะดะพะบัะพั ััะธะดะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะดะพั.", NORM: "ะดะพัะตะฝั"},
+ {ORTH: "ะธ.ะพ.", NORM: "ะธัะฟะพะปะฝัััะธะน ะพะฑัะทะฐะฝะฝะพััะธ"},
+ {ORTH: "ะบ.ะฑ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฑะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะฒะพะตะฝ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฒะพะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะณ.-ะผ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะปะพะณะพ-ะผะธะฝะตัะฐะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะณ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะณะตะพะณั.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะณะตะพะณั.ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะธ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะธััะพัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะธัะบ.", NORM: "ะบะฐะฝะดะธะดะฐั ะธัะบััััะฒะพะฒะตะดะตะฝะธั"},
+ {ORTH: "ะบ.ะผ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะผะตะดะธัะธะฝัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะฟ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฟัะธั ะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะฟัั .ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฟัะธั ะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะฟะตะด.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฟะตะดะฐะณะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด.ะฟะตะด.ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฟะตะดะฐะณะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะฟะพะปะธั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะฟะพะปะธัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.-ั .ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะตะปััะบะพั ะพะทัะนััะฒะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะพัะธะพะป.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะตั ะฝะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.-ะผ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะทะธะบะพ-ะผะฐัะตะผะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะธะป.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะธะปะพะป.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะฐัะผ.ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะฐัะผะฐะบะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะฐัะผ.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะฐัะผะฐะบะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะฐัะผ.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ัะฐัะผะฐะบะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะธะปะพั.ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะธะปะพั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะธะปะพั.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั .ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ั ะธะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั .ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ั ะธะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ััะธะดะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ั.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ััะธะดะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฐัั ะธัะตะบัััั", NORM: "ะบะฐะฝะดะธะดะฐั ะฐัั ะธัะตะบัััั"},
+ {ORTH: "ะบะฐะฝะด. ะฑะธะพะป. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฑะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฒะตัะตัะธะฝะฐั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฒะตัะตัะธะฝะฐัะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฒะพะตะฝ. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฒะพะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะณะตะพะณั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะณัะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะณะตะพะป.-ะผะธะฝะตัะฐะป. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะณะตะพะปะพะณะพ-ะผะธะฝะตัะฐะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะธัะบััััะฒะพะฒะตะดะตะฝะธั", NORM: "ะบะฐะฝะดะธะดะฐั ะธัะบััััะฒะพะฒะตะดะตะฝะธั"},
+ {ORTH: "ะบะฐะฝะด. ะธัั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะธััะพัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ะธัั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ะธััะพัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะบัะปััััะพะปะพะณะธะธ", NORM: "ะบะฐะฝะดะธะดะฐั ะบัะปััััะพะปะพะณะธะธ"},
+ {ORTH: "ะบะฐะฝะด. ะผะตะด. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะผะตะดะธัะธะฝัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฟะตะด. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฟะตะดะฐะณะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฟะพะปะธั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฟะพะปะธัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ะฟัะธั ะพะป. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ะฟัะธั ะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ั.-ั . ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะตะปััะบะพั ะพะทัะนััะฒะตะฝะฝัั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะพัะธะพะป. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะพั.ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะพั.ะฝ.", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบ.ัะพั.ะฝ", NORM: "ะบะฐะฝะดะธะดะฐั ัะพัะธะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะตั ะฝ. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะตั ะฝะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะฐัะผะฐัะตะฒั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะฐัะผะฐัะตะฒัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะธะท.-ะผะฐั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะทะธะบะพ-ะผะฐัะตะผะฐัะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะธะปะพะป. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพะปะพะณะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะธะปะพั. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะธะปะพัะพััะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ั ะธะผ. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ั ะธะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ัะบะพะฝ. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ัะบะพะฝะพะผะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะบะฐะฝะด. ััะธะด. ะฝะฐัะบ", NORM: "ะบะฐะฝะดะธะดะฐั ััะธะดะธัะตัะบะธั ะฝะฐัะบ"},
+ {ORTH: "ะฒ.ะฝ.ั.", NORM: "ะฒะตะดััะธะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะผะป. ะฝะฐัั. ัะพัั.", NORM: "ะผะปะฐะดัะธะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะผ.ะฝ.ั.", NORM: "ะผะปะฐะดัะธะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะฟัะพั.", NORM: "ะฟัะพัะตััะพั"},
+ {ORTH: "ะฟัะพัะตััะพั.ะบะฐัะตะดัั", NORM: "ะฟัะพัะตััะพั ะบะฐัะตะดัั"},
+ {ORTH: "ัั. ะฝะฐัั. ัะพัั.", NORM: "ััะฐััะธะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ัะป.-ะบ.", NORM: "ัะปะตะฝ ะบะพััะตัะฟะพะฝะดะตะฝั"},
+ {ORTH: "ัะป.-ะบะพัั.", NORM: "ัะปะตะฝ-ะบะพััะตัะฟะพะฝะดะตะฝั"},
+ {ORTH: "ัะป.-ะบะพั.", NORM: "ัะปะตะฝ-ะบะพััะตัะฟะพะฝะดะตะฝั"},
+ {ORTH: "ะดะธั.", NORM: "ะดะธัะตะบัะพั"},
+ {ORTH: "ะทะฐะผ. ะดะธั.", NORM: "ะทะฐะผะตััะธัะตะปั ะดะธัะตะบัะพัะฐ"},
+ {ORTH: "ะทะฐะฒ. ะบะฐั.", NORM: "ะทะฐะฒะตะดัััะธะน ะบะฐัะตะดัะพะน"},
+ {ORTH: "ะทะฐะฒ.ะบะฐัะตะดัะพะน", NORM: "ะทะฐะฒะตะดัััะธะน ะบะฐัะตะดัะพะน"},
+ {ORTH: "ะทะฐะฒ. ะบะฐัะตะดัะพะน", NORM: "ะทะฐะฒะตะดัััะธะน ะบะฐัะตะดัะพะน"},
+ {ORTH: "ะฐัะฟ.", NORM: "ะฐัะฟะธัะฐะฝั"},
+ {ORTH: "ะณะป. ะฝะฐัั. ัะพัั.", NORM: "ะณะปะฐะฒะฝัะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะฒะตะด. ะฝะฐัั. ัะพัั.", NORM: "ะฒะตะดััะธะน ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะฝะฐัั. ัะพัั.", NORM: "ะฝะฐััะฝัะน ัะพัััะดะฝะธะบ"},
+ {ORTH: "ะบ.ะผ.ั.", NORM: "ะบะฐะฝะดะธะดะฐั ะฒ ะผะฐััะตัะฐ ัะฟะพััะฐ"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
+
+
+for abbr in [
+ # Literary phrases abbreviations
+ {ORTH: "ะธ ั.ะด.", NORM: "ะธ ัะฐะบ ะดะฐะปะตะต"},
+ {ORTH: "ะธ ั.ะฟ.", NORM: "ะธ ัะพะผั ะฟะพะดะพะฑะฝะพะต"},
+ {ORTH: "ั.ะด.", NORM: "ัะฐะบ ะดะฐะปะตะต"},
+ {ORTH: "ั.ะฟ.", NORM: "ัะพะผั ะฟะพะดะพะฑะฝะพะต"},
+ {ORTH: "ั.ะต.", NORM: "ัะพ ะตััั"},
+ {ORTH: "ั.ะบ.", NORM: "ัะฐะบ ะบะฐะบ"},
+ {ORTH: "ะฒ ั.ั.", NORM: "ะฒ ัะพะผ ัะธัะปะต"},
+ {ORTH: "ะธ ะฟั.", NORM: "ะธ ะฟัะพัะธะต"},
+ {ORTH: "ะธ ะดั.", NORM: "ะธ ะดััะณะธะต"},
+ {ORTH: "ั.ะฝ.", NORM: "ัะฐะบ ะฝะฐะทัะฒะฐะตะผัะน"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
+
+
+for abbr in [
+ # Appeal to a person abbreviations
+ {ORTH: "ะณ-ะฝ", NORM: "ะณะพัะฟะพะดะธะฝ"},
+ {ORTH: "ะณ-ะดะฐ", NORM: "ะณะพัะฟะพะดะฐ"},
+ {ORTH: "ะณ-ะถะฐ", NORM: "ะณะพัะฟะพะถะฐ"},
+ {ORTH: "ัะพะฒ.", NORM: "ัะพะฒะฐัะธั"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
+
+
+for abbr in [
+ # Time periods abbreviations
+ {ORTH: "ะดะพ ะฝ.ั.", NORM: "ะดะพ ะฝะฐัะตะน ััั"},
+ {ORTH: "ะฟะพ ะฝ.ะฒ.", NORM: "ะฟะพ ะฝะฐััะพััะตะต ะฒัะตะผั"},
+ {ORTH: "ะฒ ะฝ.ะฒ.", NORM: "ะฒ ะฝะฐััะพััะตะต ะฒัะตะผั"},
+ {ORTH: "ะฝะฐัั.", NORM: "ะฝะฐััะพััะธะน"},
+ {ORTH: "ะฝะฐัั. ะฒัะตะผั", NORM: "ะฝะฐััะพััะตะต ะฒัะตะผั"},
+ {ORTH: "ะณ.ะณ.", NORM: "ะณะพะดั"},
+ {ORTH: "ะณะณ.", NORM: "ะณะพะดั"},
+ {ORTH: "ั.ะณ.", NORM: "ัะตะบััะธะน ะณะพะด"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
+
+
+for abbr in [
+ # Address forming elements abbreviations
+ {ORTH: "ัะตัะฟ.", NORM: "ัะตัะฟัะฑะปะธะบะฐ"},
+ {ORTH: "ะพะฑะป.", NORM: "ะพะฑะปะฐััั"},
+ {ORTH: "ะณ.ั.ะท.", NORM: "ะณะพัะพะด ัะตะดะตัะฐะปัะฝะพะณะพ ะทะฝะฐัะตะฝะธั"},
+ {ORTH: "ะฐ.ะพะฑะป.", NORM: "ะฐะฒัะพะฝะพะผะฝะฐั ะพะฑะปะฐััั"},
+ {ORTH: "ะฐ.ะพะบั.", NORM: "ะฐะฒัะพะฝะพะผะฝัะน ะพะบััะณ"},
+ {ORTH: "ะผ.ั-ะฝ", NORM: "ะผัะฝะธัะธะฟะฐะปัะฝัะน ัะฐะนะพะฝ"},
+ {ORTH: "ะณ.ะพ.", NORM: "ะณะพัะพะดัะบะพะน ะพะบััะณ"},
+ {ORTH: "ะณ.ะฟ.", NORM: "ะณะพัะพะดัะบะพะต ะฟะพัะตะปะตะฝะธะต"},
+ {ORTH: "ั.ะฟ.", NORM: "ัะตะปััะบะพะต ะฟะพัะตะปะตะฝะธะต"},
+ {ORTH: "ะฒะฝ.ั-ะฝ", NORM: "ะฒะฝัััะธะณะพัะพะดัะบะพะน ัะฐะนะพะฝ"},
+ {ORTH: "ะฒะฝ.ัะตั.ะณ.", NORM: "ะฒะฝัััะธะณะพัะพะดัะบะฐั ัะตััะธัะพัะธั ะณะพัะพะดะฐ"},
+ {ORTH: "ะฟะพั.", NORM: "ะฟะพัะตะปะตะฝะธะต"},
+ {ORTH: "ั-ะฝ", NORM: "ัะฐะนะพะฝ"},
+ {ORTH: "ั/ั", NORM: "ัะตะปััะพะฒะตั"},
+ {ORTH: "ะณ.", NORM: "ะณะพัะพะด"},
+ {ORTH: "ะฟ.ะณ.ั.", NORM: "ะฟะพัะตะปะพะบ ะณะพัะพะดัะบะพะณะพ ัะธะฟะฐ"},
+ {ORTH: "ะฟะณั.", NORM: "ะฟะพัะตะปะพะบ ะณะพัะพะดัะบะพะณะพ ัะธะฟะฐ"},
+ {ORTH: "ั.ะฟ.", NORM: "ัะฐะฑะพัะธะน ะฟะพัะตะปะพะบ"},
+ {ORTH: "ัะฟ.", NORM: "ัะฐะฑะพัะธะน ะฟะพัะตะปะพะบ"},
+ {ORTH: "ะบะฟ.", NORM: "ะบััะพััะฝัะน ะฟะพัะตะปะพะบ"},
+ {ORTH: "ะณะฟ.", NORM: "ะณะพัะพะดัะบะพะน ะฟะพัะตะปะพะบ"},
+ {ORTH: "ะฟ.", NORM: "ะฟะพัะตะปะพะบ"},
+ {ORTH: "ะฒ-ะบะธ", NORM: "ะฒััะตะปะบะธ"},
+ {ORTH: "ะณ-ะบ", NORM: "ะณะพัะพะดะพะบ"},
+ {ORTH: "ะท-ะบะฐ", NORM: "ะทะฐะธะผะบะฐ"},
+ {ORTH: "ะฟ-ะบ", NORM: "ะฟะพัะธะฝะพะบ"},
+ {ORTH: "ะบะธั.", NORM: "ะบะธัะปะฐะบ"},
+ {ORTH: "ะฟ. ัั.ย ", NORM: "ะฟะพัะตะปะพะบ ััะฐะฝัะธั"},
+ {ORTH: "ะฟ. ะถ/ะด ัั.ย ", NORM: "ะฟะพัะตะปะพะบ ะฟัะธ ะถะตะปะตะทะฝะพะดะพัะพะถะฝะพะน ััะฐะฝัะธะธ"},
+ {ORTH: "ะถ/ะด ะฑะป-ัั", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝัะน ะฑะปะพะบะฟะพัั"},
+ {ORTH: "ะถ/ะด ะฑ-ะบะฐ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ะฑัะดะบะฐ"},
+ {ORTH: "ะถ/ะด ะฒ-ะบะฐ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ะฒะตัะบะฐ"},
+ {ORTH: "ะถ/ะด ะบ-ะผะฐ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ะบะฐะทะฐัะผะฐ"},
+ {ORTH: "ะถ/ะด ะบ-ั", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝัะน ะบะพะผะฑะธะฝะฐั"},
+ {ORTH: "ะถ/ะด ะฟะป-ะผะฐ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ะฟะปะฐััะพัะผะฐ"},
+ {ORTH: "ะถ/ะด ะฟะป-ะบะฐ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ะฟะปะพัะฐะดะบะฐ"},
+ {ORTH: "ะถ/ะด ะฟ.ะฟ.", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝัะน ะฟััะตะฒะพะน ะฟะพัั"},
+ {ORTH: "ะถ/ะด ะพ.ะฟ.", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝัะน ะพััะฐะฝะพะฒะพัะฝัะน ะฟัะฝะบั"},
+ {ORTH: "ะถ/ะด ัะทะด.", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝัะน ัะฐะทัะตะทะด"},
+ {ORTH: "ะถ/ะด ัั.ย ", NORM: "ะถะตะปะตะทะฝะพะดะพัะพะถะฝะฐั ััะฐะฝัะธั"},
+ {ORTH: "ะผ-ะบะพ", NORM: "ะผะตััะตัะบะพ"},
+ {ORTH: "ะด.", NORM: "ะดะตัะตะฒะฝั"},
+ {ORTH: "ั.", NORM: "ัะตะปะพ"},
+ {ORTH: "ัะป.", NORM: "ัะปะพะฑะพะดะฐ"},
+ {ORTH: "ัั.ย ", NORM: "ััะฐะฝัะธั"},
+ {ORTH: "ัั-ัะฐ", NORM: "ััะฐะฝะธัะฐ"},
+ {ORTH: "ั.", NORM: "ัะปัั"},
+ {ORTH: "ั .", NORM: "ั ััะพั"},
+ {ORTH: "ัะทะด.", NORM: "ัะฐะทัะตะทะด"},
+ {ORTH: "ะทะธะผ.", NORM: "ะทะธะผะพะฒัะต"},
+ {ORTH: "ะฑ-ะณ", NORM: "ะฑะตัะตะณ"},
+ {ORTH: "ะถ/ั", NORM: "ะถะธะปะพะน ัะฐะนะพะฝ"},
+ {ORTH: "ะบะฒ-ะป", NORM: "ะบะฒะฐััะฐะป"},
+ {ORTH: "ะผะบั.", NORM: "ะผะธะบัะพัะฐะนะพะฝ"},
+ {ORTH: "ะพัั-ะฒ", NORM: "ะพัััะพะฒ"},
+ {ORTH: "ะฟะปะฐัั.", NORM: "ะฟะปะฐััะพัะผะฐ"},
+ {ORTH: "ะฟ/ั", NORM: "ะฟัะพะผััะปะตะฝะฝัะน ัะฐะนะพะฝ"},
+ {ORTH: "ั-ะฝ", NORM: "ัะฐะนะพะฝ"},
+ {ORTH: "ัะตั.", NORM: "ัะตััะธัะพัะธั"},
+ {
+ ORTH: "ัะตั. ะกะะ",
+ NORM: "ัะตััะธัะพัะธั ัะฐะดะพะฒะพะดัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ะพะฑัะตะดะธะฝะตะฝะธะน ะณัะฐะถะดะฐะฝ",
+ },
+ {
+ ORTH: "ัะตั. ะะะ",
+ NORM: "ัะตััะธัะพัะธั ะพะณะพัะพะดะฝะธัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ะพะฑัะตะดะธะฝะตะฝะธะน ะณัะฐะถะดะฐะฝ",
+ },
+ {ORTH: "ัะตั. ะะะ", NORM: "ัะตััะธัะพัะธั ะดะฐัะฝัั ะฝะตะบะพะผะผะตััะตัะบะธั ะพะฑัะตะดะธะฝะตะฝะธะน ะณัะฐะถะดะฐะฝ"},
+ {ORTH: "ัะตั. ะกะะข", NORM: "ัะตััะธัะพัะธั ัะฐะดะพะฒะพะดัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ัะพะฒะฐัะธัะตััะฒ"},
+ {ORTH: "ัะตั. ะะะข", NORM: "ัะตััะธัะพัะธั ะพะณะพัะพะดะฝะธัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ัะพะฒะฐัะธัะตััะฒ"},
+ {ORTH: "ัะตั. ะะะข", NORM: "ัะตััะธัะพัะธั ะดะฐัะฝัั ะฝะตะบะพะผะผะตััะตัะบะธั ัะพะฒะฐัะธัะตััะฒ"},
+ {ORTH: "ัะตั. ะกะะ", NORM: "ัะตััะธัะพัะธั ัะฐะดะพะฒะพะดัะตัะบะธั ะฟะพััะตะฑะธัะตะปััะบะธั ะบะพะพะฟะตัะฐัะธะฒะพะฒ"},
+ {ORTH: "ัะตั. ะะะ", NORM: "ัะตััะธัะพัะธั ะพะณะพัะพะดะฝะธัะตัะบะธั ะฟะพััะตะฑะธัะตะปััะบะธั ะบะพะพะฟะตัะฐัะธะฒะพะฒ"},
+ {ORTH: "ัะตั. ะะะ", NORM: "ัะตััะธัะพัะธั ะดะฐัะฝัั ะฟะพััะตะฑะธัะตะปััะบะธั ะบะพะพะฟะตัะฐัะธะฒะพะฒ"},
+ {ORTH: "ัะตั. ะกะะ", NORM: "ัะตััะธัะพัะธั ัะฐะดะพะฒะพะดัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ะฟะฐััะฝะตัััะฒ"},
+ {ORTH: "ัะตั. ะะะ", NORM: "ัะตััะธัะพัะธั ะพะณะพัะพะดะฝะธัะตัะบะธั ะฝะตะบะพะผะผะตััะตัะบะธั ะฟะฐััะฝะตัััะฒ"},
+ {ORTH: "ัะตั. ะะะ", NORM: "ัะตััะธัะพัะธั ะดะฐัะฝัั ะฝะตะบะพะผะผะตััะตัะบะธั ะฟะฐััะฝะตัััะฒ"},
+ {ORTH: "ัะตั. ะขะกะ", NORM: "ัะตััะธัะพัะธั ัะพะฒะฐัะธัะตััะฒะฐ ัะพะฑััะฒะตะฝะฝะธะบะพะฒ ะฝะตะดะฒะธะถะธะผะพััะธ"},
+ {ORTH: "ัะตั. ะะกะ", NORM: "ัะตััะธัะพัะธั ะณะฐัะฐะถะฝะพ-ัััะพะธัะตะปัะฝะพะณะพ ะบะพะพะฟะตัะฐัะธะฒะฐ"},
+ {ORTH: "ัั.", NORM: "ััะฐะดัะฑะฐ"},
+ {ORTH: "ัะตั.ั.ั .", NORM: "ัะตััะธัะพัะธั ัะตัะผะตััะบะพะณะพ ั ะพะทัะนััะฒะฐ"},
+ {ORTH: "ั.", NORM: "ัััั"},
+ {ORTH: "ะฐะป.", NORM: "ะฐะปะปะตั"},
+ {ORTH: "ะฑ-ั", NORM: "ะฑัะปัะฒะฐั"},
+ {ORTH: "ะฒะทะฒ.", NORM: "ะฒะทะฒะพะท"},
+ {ORTH: "ะฒะทะด.", NORM: "ะฒัะตะทะด"},
+ {ORTH: "ะดะพั.", NORM: "ะดะพัะพะณะฐ"},
+ {ORTH: "ะทะทะด.", NORM: "ะทะฐะตะทะด"},
+ {ORTH: "ะบะผ", NORM: "ะบะธะปะพะผะตัั"},
+ {ORTH: "ะบ-ัะพ", NORM: "ะบะพะปััะพ"},
+ {ORTH: "ะปะฝ.", NORM: "ะปะธะฝะธั"},
+ {ORTH: "ะผะณััั.", NORM: "ะผะฐะณะธัััะฐะปั"},
+ {ORTH: "ะฝะฐะฑ.", NORM: "ะฝะฐะฑะตัะตะถะฝะฐั"},
+ {ORTH: "ะฟะตั-ะด", NORM: "ะฟะตัะตะตะทะด"},
+ {ORTH: "ะฟะตั.", NORM: "ะฟะตัะตัะปะพะบ"},
+ {ORTH: "ะฟะป-ะบะฐ", NORM: "ะฟะปะพัะฐะดะบะฐ"},
+ {ORTH: "ะฟะป.", NORM: "ะฟะปะพัะฐะดั"},
+ {ORTH: "ะฟั-ะด", NORM: "ะฟัะพะตะทะด"},
+ {ORTH: "ะฟั-ะบ", NORM: "ะฟัะพัะตะบ"},
+ {ORTH: "ะฟั-ะบะฐ", NORM: "ะฟัะพัะตะบะฐ"},
+ {ORTH: "ะฟั-ะปะพะบ", NORM: "ะฟัะพัะตะปะพะบ"},
+ {ORTH: "ะฟั-ะบั", NORM: "ะฟัะพัะฟะตะบั"},
+ {ORTH: "ะฟัะพัะป.", NORM: "ะฟัะพัะปะพะบ"},
+ {ORTH: "ัะทะด.", NORM: "ัะฐะทัะตะทะด"},
+ {ORTH: "ััะด", NORM: "ััะด(ั)"},
+ {ORTH: "ั-ั", NORM: "ัะบะฒะตั"},
+ {ORTH: "ั-ะบ", NORM: "ัะฟััะบ"},
+ {ORTH: "ัะทะด.", NORM: "ััะตะทะด"},
+ {ORTH: "ััะฟ.", NORM: "ััะฟะธะบ"},
+ {ORTH: "ัะป.", NORM: "ัะปะธัะฐ"},
+ {ORTH: "ั.", NORM: "ัะพััะต"},
+ {ORTH: "ะฒะปะด.", NORM: "ะฒะปะฐะดะตะฝะธะต"},
+ {ORTH: "ะณ-ะถ", NORM: "ะณะฐัะฐะถ"},
+ {ORTH: "ะด.", NORM: "ะดะพะผ"},
+ {ORTH: "ะดะฒะปะด.", NORM: "ะดะพะผะพะฒะปะฐะดะตะฝะธะต"},
+ {ORTH: "ะทะด.", NORM: "ะทะดะฐะฝะธะต"},
+ {ORTH: "ะท/ั", NORM: "ะทะตะผะตะปัะฝัะน ััะฐััะพะบ"},
+ {ORTH: "ะบะฒ.", NORM: "ะบะฒะฐััะธัะฐ"},
+ {ORTH: "ะบะพะผ.", NORM: "ะบะพะผะฝะฐัะฐ"},
+ {ORTH: "ะฟะพะดะฒ.", NORM: "ะฟะพะดะฒะฐะป"},
+ {ORTH: "ะบะพั.", NORM: "ะบะพัะตะปัะฝะฐั"},
+ {ORTH: "ะฟ-ะฑ", NORM: "ะฟะพะณัะตะฑ"},
+ {ORTH: "ะบ.", NORM: "ะบะพัะฟัั"},
+ {ORTH: "ะะะก", NORM: "ะพะฑัะตะบั ะฝะตะทะฐะฒะตััะตะฝะฝะพะณะพ ัััะพะธัะตะปัััะฒะฐ"},
+ {ORTH: "ะพั.", NORM: "ะพัะธั"},
+ {ORTH: "ะฟะฐะฒ.", NORM: "ะฟะฐะฒะธะปัะพะฝ"},
+ {ORTH: "ะฟะพะผะตั.", NORM: "ะฟะพะผะตัะตะฝะธะต"},
+ {ORTH: "ัะฐะฑ.ัั.", NORM: "ัะฐะฑะพัะธะน ััะฐััะพะบ"},
+ {ORTH: "ัะบะป.", NORM: "ัะบะปะฐะด"},
+ {ORTH: "coop.", NORM: "ัะพะพััะถะตะฝะธะต"},
+ {ORTH: "ััั.", NORM: "ัััะพะตะฝะธะต"},
+ {ORTH: "ัะพัะณ.ะทะฐะป", NORM: "ัะพัะณะพะฒัะน ะทะฐะป"},
+ {ORTH: "ะฐ/ะฟ", NORM: "ะฐััะพะฟะพัั"},
+ {ORTH: "ะธะผ.", NORM: "ะธะผะตะฝะธ"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
+
+
+for abbr in [
+ # Others abbreviations
+ {ORTH: "ััั.ััะฑ.", NORM: "ััััั ััะฑะปะตะน"},
+ {ORTH: "ััั.", NORM: "ััััั"},
+ {ORTH: "ััะฑ.", NORM: "ััะฑะปั"},
+ {ORTH: "ะดะพะปะป.", NORM: "ะดะพะปะปะฐั"},
+ {ORTH: "ะฟัะธะผ.", NORM: "ะฟัะธะผะตัะฐะฝะธะต"},
+ {ORTH: "ะฟัะธะผ.ัะตะด.", NORM: "ะฟัะธะผะตัะฐะฝะธะต ัะตะดะฐะบัะธะธ"},
+ {ORTH: "ัะผ. ัะฐะบะถะต", NORM: "ัะผะพััะธ ัะฐะบะถะต"},
+ {ORTH: "ะบะฒ.ะผ.", NORM: "ะบะฒะฐะดัะฐะฝัะฝัะน ะผะตัั"},
+ {ORTH: "ะผ2", NORM: "ะบะฒะฐะดัะฐะฝัะฝัะน ะผะตัั"},
+ {ORTH: "ะฑ/ั", NORM: "ะฑัะฒัะธะน ะฒ ัะฟะพััะตะฑะปะตะฝะธะธ"},
+ {ORTH: "ัะพะบั.", NORM: "ัะพะบัะฐัะตะฝะธะต"},
+ {ORTH: "ัะตะป.", NORM: "ัะตะปะพะฒะตะบ"},
+ {ORTH: "ะฑ.ะฟ.", NORM: "ะฑะฐะทะธัะฝัะน ะฟัะฝะบั"},
+]:
+ _exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
diff --git a/spacy/lang/sl/examples.py b/spacy/lang/sl/examples.py
new file mode 100644
index 000000000..bf483c6a4
--- /dev/null
+++ b/spacy/lang/sl/examples.py
@@ -0,0 +1,18 @@
+"""
+Example sentences to test spaCy and its language models.
+
+>>> from spacy.lang.sl.examples import sentences
+>>> docs = nlp.pipe(sentences)
+"""
+
+
+sentences = [
+ "Apple naฤrtuje nakup britanskega startupa za 1 bilijon dolarjev",
+ "France Preลกeren je umrl 8. februarja 1849 v Kranju",
+ "Staro ljubljansko letaliลกฤe Moste bo obnovila druลพba BTC",
+ "London je najveฤje mesto v Zdruลพenem kraljestvu.",
+ "Kje se skrivaลก?",
+ "Kdo je predsednik Francije?",
+ "Katero je glavno mesto Zdruลพenih drลพav Amerike?",
+ "Kdaj je bil rojen Milan Kuฤan?",
+]
diff --git a/spacy/lang/tr/lex_attrs.py b/spacy/lang/tr/lex_attrs.py
index f7416837d..6d9f4f388 100644
--- a/spacy/lang/tr/lex_attrs.py
+++ b/spacy/lang/tr/lex_attrs.py
@@ -53,7 +53,7 @@ _ordinal_words = [
"doksanฤฑncฤฑ",
"yรผzรผncรผ",
"bininci",
- "mliyonuncu",
+ "milyonuncu",
"milyarฤฑncฤฑ",
"trilyonuncu",
"katrilyonuncu",
diff --git a/spacy/lang/vi/lex_attrs.py b/spacy/lang/vi/lex_attrs.py
index 33a3745cc..0cbda4ffb 100644
--- a/spacy/lang/vi/lex_attrs.py
+++ b/spacy/lang/vi/lex_attrs.py
@@ -2,22 +2,29 @@ from ...attrs import LIKE_NUM
_num_words = [
- "khรดng",
- "mแปt",
- "hai",
- "ba",
- "bแปn",
- "nฤm",
- "sรกu",
- "bแบฃy",
- "bแบฉy",
- "tรกm",
- "chรญn",
- "mฦฐแปi",
- "chแปฅc",
- "trฤm",
- "nghรฌn",
- "tแปท",
+ "khรดng", # Zero
+ "mแปt", # One
+ "mแปt", # Also one, irreplacable in nichรฉ cases for unit digit such as "51"="nฤm mฦฐฦกi mแปt"
+ "hai", # Two
+ "ba", # Three
+ "bแปn", # Four
+ "tฦฐ", # Also four, used in certain cases for unit digit such as "54"="nฤm mฦฐฦกi tฦฐ"
+ "nฤm", # Five
+ "lฤm", # Also five, irreplacable in nichรฉ cases for unit digit such as "55"="nฤm mฦฐฦกi lฤm"
+ "sรกu", # Six
+ "bแบฃy", # Seven
+ "bแบฉy", # Also seven, old fashioned
+ "tรกm", # Eight
+ "chรญn", # Nine
+ "mฦฐแปi", # Ten
+ "chแปฅc", # Also ten, used for counting in tens such as "20 eggs"="hai chแปฅc trแปฉng"
+ "trฤm", # Hundred
+ "nghรฌn", # Thousand
+ "ngร n", # Also thousand, used in the south
+ "vแบกn", # Ten thousand
+ "triแปu", # Million
+ "tแปท", # Billion
+ "tแป", # Also billion, used in combinatorics such as "tแป_phรบ"="billionaire"
]
diff --git a/spacy/language.py b/spacy/language.py
index fdce34ac4..bab403f0e 100644
--- a/spacy/language.py
+++ b/spacy/language.py
@@ -131,7 +131,7 @@ class Language:
self,
vocab: Union[Vocab, bool] = True,
*,
- max_length: int = 10 ** 6,
+ max_length: int = 10**6,
meta: Dict[str, Any] = {},
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
batch_size: int = 1000,
@@ -1222,8 +1222,9 @@ class Language:
component_cfg = {}
grads = {}
- def get_grads(W, dW, key=None):
+ def get_grads(key, W, dW):
grads[key] = (W, dW)
+ return W, dW
get_grads.learn_rate = sgd.learn_rate # type: ignore[attr-defined, union-attr]
get_grads.b1 = sgd.b1 # type: ignore[attr-defined, union-attr]
@@ -1236,7 +1237,7 @@ class Language:
examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {})
)
for key, (W, dW) in grads.items():
- sgd(W, dW, key=key) # type: ignore[call-arg, misc]
+ sgd(key, W, dW) # type: ignore[call-arg, misc]
return losses
def begin_training(
diff --git a/spacy/matcher/matcher.pyx b/spacy/matcher/matcher.pyx
index 6aa58f0e3..e43583e30 100644
--- a/spacy/matcher/matcher.pyx
+++ b/spacy/matcher/matcher.pyx
@@ -244,8 +244,12 @@ cdef class Matcher:
pipe = "parser"
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
raise ValueError(error_msg)
- matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
- extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
+
+ if self.patterns.empty():
+ matches = []
+ else:
+ matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
+ extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
final_matches = []
pairs_by_id = {}
# For each key, either add all matches, or only the filtered,
@@ -686,18 +690,14 @@ cdef int8_t get_is_match(PatternStateC state,
return True
-cdef int8_t get_is_final(PatternStateC state) nogil:
+cdef inline int8_t get_is_final(PatternStateC state) nogil:
if state.pattern[1].quantifier == FINAL_ID:
- id_attr = state.pattern[1].attrs[0]
- if id_attr.attr != ID:
- with gil:
- raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return 1
else:
return 0
-cdef int8_t get_quantifier(PatternStateC state) nogil:
+cdef inline int8_t get_quantifier(PatternStateC state) nogil:
return state.pattern.quantifier
diff --git a/spacy/matcher/phrasematcher.pyi b/spacy/matcher/phrasematcher.pyi
index 82a194835..68e3386e4 100644
--- a/spacy/matcher/phrasematcher.pyi
+++ b/spacy/matcher/phrasematcher.pyi
@@ -14,7 +14,7 @@ class PhraseMatcher:
def add(
self,
key: str,
- docs: List[List[Dict[str, Any]]],
+ docs: List[Doc],
*,
on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
diff --git a/spacy/ml/extract_spans.py b/spacy/ml/extract_spans.py
index edc86ff9c..d5e9bc07c 100644
--- a/spacy/ml/extract_spans.py
+++ b/spacy/ml/extract_spans.py
@@ -63,4 +63,4 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
- return (Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths))
+ return Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths)
diff --git a/spacy/ml/models/entity_linker.py b/spacy/ml/models/entity_linker.py
index 831fee90f..0149bea89 100644
--- a/spacy/ml/models/entity_linker.py
+++ b/spacy/ml/models/entity_linker.py
@@ -1,34 +1,82 @@
from pathlib import Path
-from typing import Optional, Callable, Iterable, List
+from typing import Optional, Callable, Iterable, List, Tuple
from thinc.types import Floats2d
from thinc.api import chain, clone, list2ragged, reduce_mean, residual
-from thinc.api import Model, Maxout, Linear
+from thinc.api import Model, Maxout, Linear, noop, tuplify, Ragged
from ...util import registry
from ...kb import KnowledgeBase, Candidate, get_candidates
from ...vocab import Vocab
from ...tokens import Span, Doc
+from ..extract_spans import extract_spans
+from ...errors import Errors
-@registry.architectures("spacy.EntityLinker.v1")
+@registry.architectures("spacy.EntityLinker.v2")
def build_nel_encoder(
tok2vec: Model, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]:
- with Model.define_operators({">>": chain, "**": clone}):
+ with Model.define_operators({">>": chain, "&": tuplify}):
token_width = tok2vec.maybe_get_dim("nO")
output_layer = Linear(nO=nO, nI=token_width)
model = (
- tok2vec
- >> list2ragged()
+ ((tok2vec >> list2ragged()) & build_span_maker())
+ >> extract_spans()
>> reduce_mean()
>> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) # type: ignore[arg-type]
>> output_layer
)
model.set_ref("output_layer", output_layer)
model.set_ref("tok2vec", tok2vec)
+ # flag to show this isn't legacy
+ model.attrs["include_span_maker"] = True
return model
+def build_span_maker(n_sents: int = 0) -> Model:
+ model: Model = Model("span_maker", forward=span_maker_forward)
+ model.attrs["n_sents"] = n_sents
+ return model
+
+
+def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callable]:
+ ops = model.ops
+ n_sents = model.attrs["n_sents"]
+ candidates = []
+ for doc in docs:
+ cands = []
+ try:
+ sentences = [s for s in doc.sents]
+ except ValueError:
+ # no sentence info, normal in initialization
+ for tok in doc:
+ tok.is_sent_start = tok.i == 0
+ sentences = [doc[:]]
+ for ent in doc.ents:
+ try:
+ # find the sentence in the list of sentences.
+ sent_index = sentences.index(ent.sent)
+ except AttributeError:
+ # Catch the exception when ent.sent is None and provide a user-friendly warning
+ raise RuntimeError(Errors.E030) from None
+ # get n previous sentences, if there are any
+ start_sentence = max(0, sent_index - n_sents)
+ # get n posterior sentences, or as many < n as there are
+ end_sentence = min(len(sentences) - 1, sent_index + n_sents)
+ # get token positions
+ start_token = sentences[start_sentence].start
+ end_token = sentences[end_sentence].end
+ # save positions for extraction
+ cands.append((start_token, end_token))
+
+ candidates.append(ops.asarray2i(cands))
+ candlens = ops.asarray1i([len(cands) for cands in candidates])
+ candidates = ops.xp.concatenate(candidates)
+ outputs = Ragged(candidates, candlens)
+ # because this is just rearranging docs, the backprop does nothing
+ return outputs, lambda x: []
+
+
@registry.misc("spacy.KBFromFile.v1")
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
def kb_from_file(vocab):
diff --git a/spacy/ml/models/multi_task.py b/spacy/ml/models/multi_task.py
index 9e1face63..a7d67c6dd 100644
--- a/spacy/ml/models/multi_task.py
+++ b/spacy/ml/models/multi_task.py
@@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
target = target.reshape((-1, 256 * nr_char))
diff = prediction - target
- loss = (diff ** 2).sum()
+ loss = (diff**2).sum()
d_target = diff / float(prediction.shape[0])
return loss, d_target
diff --git a/spacy/ml/models/tagger.py b/spacy/ml/models/tagger.py
index 9c7fe042d..9f8ef7b2b 100644
--- a/spacy/ml/models/tagger.py
+++ b/spacy/ml/models/tagger.py
@@ -1,14 +1,14 @@
from typing import Optional, List
-from thinc.api import zero_init, with_array, Softmax, chain, Model
+from thinc.api import zero_init, with_array, Softmax_v2, chain, Model
from thinc.types import Floats2d
from ...util import registry
from ...tokens import Doc
-@registry.architectures("spacy.Tagger.v1")
+@registry.architectures("spacy.Tagger.v2")
def build_tagger_model(
- tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
+ tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False
) -> Model[List[Doc], List[Floats2d]]:
"""Build a tagger model, using a provided token-to-vector component. The tagger
model simply adds a linear layer with softmax activation to predict scores
@@ -19,7 +19,9 @@ def build_tagger_model(
"""
# TODO: glorot_uniform_init seems to work a bit better than zero_init here?!
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
- output_layer = Softmax(nO, t2v_width, init_W=zero_init)
+ output_layer = Softmax_v2(
+ nO, t2v_width, init_W=zero_init, normalize_outputs=normalize
+ )
softmax = with_array(output_layer) # type: ignore
model = chain(tok2vec, softmax)
model.set_ref("tok2vec", tok2vec)
diff --git a/spacy/ml/parser_model.pyx b/spacy/ml/parser_model.pyx
index da937ca4f..4e854178d 100644
--- a/spacy/ml/parser_model.pyx
+++ b/spacy/ml/parser_model.pyx
@@ -11,6 +11,7 @@ import numpy.random
from thinc.api import Model, CupyOps, NumpyOps
from .. import util
+from ..errors import Errors
from ..typedefs cimport weight_t, class_t, hash_t
from ..pipeline._parser_internals.stateclass cimport StateClass
@@ -411,7 +412,7 @@ cdef class precompute_hiddens:
elif name == "nO":
return self.nO
else:
- raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP")
+ raise ValueError(Errors.E1033.format(name=name))
def set_dim(self, name, value):
if name == "nF":
@@ -421,7 +422,7 @@ cdef class precompute_hiddens:
elif name == "nO":
self.nO = value
else:
- raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP")
+ raise ValueError(Errors.E1033.format(name=name))
def __call__(self, X, bint is_train):
if is_train:
diff --git a/spacy/pipeline/__init__.py b/spacy/pipeline/__init__.py
index 7b483724c..938ab08c6 100644
--- a/spacy/pipeline/__init__.py
+++ b/spacy/pipeline/__init__.py
@@ -1,5 +1,6 @@
from .attributeruler import AttributeRuler
from .dep_parser import DependencyParser
+from .edit_tree_lemmatizer import EditTreeLemmatizer
from .entity_linker import EntityLinker
from .ner import EntityRecognizer
from .entityruler import EntityRuler
diff --git a/spacy/pipeline/_edit_tree_internals/__init__.py b/spacy/pipeline/_edit_tree_internals/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/spacy/pipeline/_edit_tree_internals/edit_trees.pxd b/spacy/pipeline/_edit_tree_internals/edit_trees.pxd
new file mode 100644
index 000000000..dc4289f37
--- /dev/null
+++ b/spacy/pipeline/_edit_tree_internals/edit_trees.pxd
@@ -0,0 +1,93 @@
+from libc.stdint cimport uint32_t, uint64_t
+from libcpp.unordered_map cimport unordered_map
+from libcpp.vector cimport vector
+
+from ...typedefs cimport attr_t, hash_t, len_t
+from ...strings cimport StringStore
+
+cdef extern from "" namespace "std" nogil:
+ void swap[T](T& a, T& b) except + # Only available in Cython 3.
+
+# An edit tree (Mรผller et al., 2015) is a tree structure that consists of
+# edit operations. The two types of operations are string matches
+# and string substitutions. Given an input string s and an output string t,
+# subsitution and match nodes should be interpreted as follows:
+#
+# * Substitution node: consists of an original string and substitute string.
+# If s matches the original string, then t is the substitute. Otherwise,
+# the node does not apply.
+# * Match node: consists of a prefix length, suffix length, prefix edit tree,
+# and suffix edit tree. If s is composed of a prefix, middle part, and suffix
+# with the given suffix and prefix lengths, then t is the concatenation
+# prefix_tree(prefix) + middle + suffix_tree(suffix).
+#
+# For efficiency, we represent strings in substitution nodes as integers, with
+# the actual strings stored in a StringStore. Subtrees in match nodes are stored
+# as tree identifiers (rather than pointers) to simplify serialization.
+
+cdef uint32_t NULL_TREE_ID
+
+cdef struct MatchNodeC:
+ len_t prefix_len
+ len_t suffix_len
+ uint32_t prefix_tree
+ uint32_t suffix_tree
+
+cdef struct SubstNodeC:
+ attr_t orig
+ attr_t subst
+
+cdef union NodeC:
+ MatchNodeC match_node
+ SubstNodeC subst_node
+
+cdef struct EditTreeC:
+ bint is_match_node
+ NodeC inner
+
+cdef inline EditTreeC edittree_new_match(len_t prefix_len, len_t suffix_len,
+ uint32_t prefix_tree, uint32_t suffix_tree):
+ cdef MatchNodeC match_node = MatchNodeC(prefix_len=prefix_len,
+ suffix_len=suffix_len, prefix_tree=prefix_tree,
+ suffix_tree=suffix_tree)
+ cdef NodeC inner = NodeC(match_node=match_node)
+ return EditTreeC(is_match_node=True, inner=inner)
+
+cdef inline EditTreeC edittree_new_subst(attr_t orig, attr_t subst):
+ cdef EditTreeC node
+ cdef SubstNodeC subst_node = SubstNodeC(orig=orig, subst=subst)
+ cdef NodeC inner = NodeC(subst_node=subst_node)
+ return EditTreeC(is_match_node=False, inner=inner)
+
+cdef inline uint64_t edittree_hash(EditTreeC tree):
+ cdef MatchNodeC match_node
+ cdef SubstNodeC subst_node
+
+ if tree.is_match_node:
+ match_node = tree.inner.match_node
+ return hash((match_node.prefix_len, match_node.suffix_len, match_node.prefix_tree, match_node.suffix_tree))
+ else:
+ subst_node = tree.inner.subst_node
+ return hash((subst_node.orig, subst_node.subst))
+
+cdef struct LCS:
+ int source_begin
+ int source_end
+ int target_begin
+ int target_end
+
+cdef inline bint lcs_is_empty(LCS lcs):
+ return lcs.source_begin == 0 and lcs.source_end == 0 and lcs.target_begin == 0 and lcs.target_end == 0
+
+cdef class EditTrees:
+ cdef vector[EditTreeC] trees
+ cdef unordered_map[hash_t, uint32_t] map
+ cdef StringStore strings
+
+ cpdef uint32_t add(self, str form, str lemma)
+ cpdef str apply(self, uint32_t tree_id, str form)
+ cpdef unicode tree_to_str(self, uint32_t tree_id)
+
+ cdef uint32_t _add(self, str form, str lemma)
+ cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces)
+ cdef uint32_t _tree_id(self, EditTreeC tree)
diff --git a/spacy/pipeline/_edit_tree_internals/edit_trees.pyx b/spacy/pipeline/_edit_tree_internals/edit_trees.pyx
new file mode 100644
index 000000000..9d18c0334
--- /dev/null
+++ b/spacy/pipeline/_edit_tree_internals/edit_trees.pyx
@@ -0,0 +1,305 @@
+# cython: infer_types=True, binding=True
+from cython.operator cimport dereference as deref
+from libc.stdint cimport uint32_t
+from libc.stdint cimport UINT32_MAX
+from libc.string cimport memset
+from libcpp.pair cimport pair
+from libcpp.vector cimport vector
+
+from pathlib import Path
+
+from ...typedefs cimport hash_t
+
+from ... import util
+from ...errors import Errors
+from ...strings import StringStore
+from .schemas import validate_edit_tree
+
+
+NULL_TREE_ID = UINT32_MAX
+
+cdef LCS find_lcs(str source, str target):
+ """
+ Find the longest common subsequence (LCS) between two strings. If there are
+ multiple LCSes, only one of them is returned.
+
+ source (str): The first string.
+ target (str): The second string.
+ RETURNS (LCS): The spans of the longest common subsequences.
+ """
+ cdef Py_ssize_t source_len = len(source)
+ cdef Py_ssize_t target_len = len(target)
+ cdef size_t longest_align = 0;
+ cdef int source_idx, target_idx
+ cdef LCS lcs
+ cdef Py_UCS4 source_cp, target_cp
+
+ memset(&lcs, 0, sizeof(lcs))
+
+ cdef vector[size_t] prev_aligns = vector[size_t](target_len);
+ cdef vector[size_t] cur_aligns = vector[size_t](target_len);
+
+ for (source_idx, source_cp) in enumerate(source):
+ for (target_idx, target_cp) in enumerate(target):
+ if source_cp == target_cp:
+ if source_idx == 0 or target_idx == 0:
+ cur_aligns[target_idx] = 1
+ else:
+ cur_aligns[target_idx] = prev_aligns[target_idx - 1] + 1
+
+ # Check if this is the longest alignment and replace previous
+ # best alignment when this is the case.
+ if cur_aligns[target_idx] > longest_align:
+ longest_align = cur_aligns[target_idx]
+ lcs.source_begin = source_idx - longest_align + 1
+ lcs.source_end = source_idx + 1
+ lcs.target_begin = target_idx - longest_align + 1
+ lcs.target_end = target_idx + 1
+ else:
+ # No match, we start with a zero-length alignment.
+ cur_aligns[target_idx] = 0
+ swap(prev_aligns, cur_aligns)
+
+ return lcs
+
+cdef class EditTrees:
+ """Container for constructing and storing edit trees."""
+ def __init__(self, strings: StringStore):
+ """Create a container for edit trees.
+
+ strings (StringStore): the string store to use."""
+ self.strings = strings
+
+ cpdef uint32_t add(self, str form, str lemma):
+ """Add an edit tree that rewrites the given string into the given lemma.
+
+ RETURNS (int): identifier of the edit tree in the container.
+ """
+ # Treat two empty strings as a special case. Generating an edit
+ # tree for identical strings results in a match node. However,
+ # since two empty strings have a zero-length LCS, a substitution
+ # node would be created. Since we do not want to clutter the
+ # recursive tree construction with logic for this case, handle
+ # it in this wrapper method.
+ if len(form) == 0 and len(lemma) == 0:
+ tree = edittree_new_match(0, 0, NULL_TREE_ID, NULL_TREE_ID)
+ return self._tree_id(tree)
+
+ return self._add(form, lemma)
+
+ cdef uint32_t _add(self, str form, str lemma):
+ cdef LCS lcs = find_lcs(form, lemma)
+
+ cdef EditTreeC tree
+ cdef uint32_t tree_id, prefix_tree, suffix_tree
+ if lcs_is_empty(lcs):
+ tree = edittree_new_subst(self.strings.add(form), self.strings.add(lemma))
+ else:
+ # If we have a non-empty LCS, such as "gooi" in "ge[gooi]d" and "[gooi]en",
+ # create edit trees for the prefix pair ("ge"/"") and the suffix pair ("d"/"en").
+ prefix_tree = NULL_TREE_ID
+ if lcs.source_begin != 0 or lcs.target_begin != 0:
+ prefix_tree = self.add(form[:lcs.source_begin], lemma[:lcs.target_begin])
+
+ suffix_tree = NULL_TREE_ID
+ if lcs.source_end != len(form) or lcs.target_end != len(lemma):
+ suffix_tree = self.add(form[lcs.source_end:], lemma[lcs.target_end:])
+
+ tree = edittree_new_match(lcs.source_begin, len(form) - lcs.source_end, prefix_tree, suffix_tree)
+
+ return self._tree_id(tree)
+
+ cdef uint32_t _tree_id(self, EditTreeC tree):
+ # If this tree has been constructed before, return its identifier.
+ cdef hash_t hash = edittree_hash(tree)
+ cdef unordered_map[hash_t, uint32_t].iterator iter = self.map.find(hash)
+ if iter != self.map.end():
+ return deref(iter).second
+
+ # The tree hasn't been seen before, store it.
+ cdef uint32_t tree_id = self.trees.size()
+ self.trees.push_back(tree)
+ self.map.insert(pair[hash_t, uint32_t](hash, tree_id))
+
+ return tree_id
+
+ cpdef str apply(self, uint32_t tree_id, str form):
+ """Apply an edit tree to a form.
+
+ tree_id (uint32_t): the identifier of the edit tree to apply.
+ form (str): the form to apply the edit tree to.
+ RETURNS (str): the transformer form or None if the edit tree
+ could not be applied to the form.
+ """
+ if tree_id >= self.trees.size():
+ raise IndexError(Errors.E1030)
+
+ lemma_pieces = []
+ try:
+ self._apply(tree_id, form, lemma_pieces)
+ except ValueError:
+ return None
+ return "".join(lemma_pieces)
+
+ cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces):
+ """Recursively apply an edit tree to a form, adding pieces to
+ the lemma_pieces list."""
+ assert tree_id <= self.trees.size()
+
+ cdef EditTreeC tree = self.trees[tree_id]
+ cdef MatchNodeC match_node
+ cdef int suffix_start
+
+ if tree.is_match_node:
+ match_node = tree.inner.match_node
+
+ if match_node.prefix_len + match_node.suffix_len > len(form_part):
+ raise ValueError(Errors.E1029)
+
+ suffix_start = len(form_part) - match_node.suffix_len
+
+ if match_node.prefix_tree != NULL_TREE_ID:
+ self._apply(match_node.prefix_tree, form_part[:match_node.prefix_len], lemma_pieces)
+
+ lemma_pieces.append(form_part[match_node.prefix_len:suffix_start])
+
+ if match_node.suffix_tree != NULL_TREE_ID:
+ self._apply(match_node.suffix_tree, form_part[suffix_start:], lemma_pieces)
+ else:
+ if form_part == self.strings[tree.inner.subst_node.orig]:
+ lemma_pieces.append(self.strings[tree.inner.subst_node.subst])
+ else:
+ raise ValueError(Errors.E1029)
+
+ cpdef unicode tree_to_str(self, uint32_t tree_id):
+ """Return the tree as a string. The tree tree string is formatted
+ like an S-expression. This is primarily useful for debugging. Match
+ nodes have the following format:
+
+ (m prefix_len suffix_len prefix_tree suffix_tree)
+
+ Substitution nodes have the following format:
+
+ (s original substitute)
+
+ tree_id (uint32_t): the identifier of the edit tree.
+ RETURNS (str): the tree as an S-expression.
+ """
+
+ if tree_id >= self.trees.size():
+ raise IndexError(Errors.E1030)
+
+ cdef EditTreeC tree = self.trees[tree_id]
+ cdef SubstNodeC subst_node
+
+ if not tree.is_match_node:
+ subst_node = tree.inner.subst_node
+ return f"(s '{self.strings[subst_node.orig]}' '{self.strings[subst_node.subst]}')"
+
+ cdef MatchNodeC match_node = tree.inner.match_node
+
+ prefix_tree = "()"
+ if match_node.prefix_tree != NULL_TREE_ID:
+ prefix_tree = self.tree_to_str(match_node.prefix_tree)
+
+ suffix_tree = "()"
+ if match_node.suffix_tree != NULL_TREE_ID:
+ suffix_tree = self.tree_to_str(match_node.suffix_tree)
+
+ return f"(m {match_node.prefix_len} {match_node.suffix_len} {prefix_tree} {suffix_tree})"
+
+ def from_json(self, trees: list) -> "EditTrees":
+ self.trees.clear()
+
+ for tree in trees:
+ tree = _dict2tree(tree)
+ self.trees.push_back(tree)
+
+ self._rebuild_tree_map()
+
+ def from_bytes(self, bytes_data: bytes, *) -> "EditTrees":
+ def deserialize_trees(tree_dicts):
+ cdef EditTreeC c_tree
+ for tree_dict in tree_dicts:
+ c_tree = _dict2tree(tree_dict)
+ self.trees.push_back(c_tree)
+
+ deserializers = {}
+ deserializers["trees"] = lambda n: deserialize_trees(n)
+ util.from_bytes(bytes_data, deserializers, [])
+
+ self._rebuild_tree_map()
+
+ return self
+
+ def to_bytes(self, **kwargs) -> bytes:
+ tree_dicts = []
+ for tree in self.trees:
+ tree = _tree2dict(tree)
+ tree_dicts.append(tree)
+
+ serializers = {}
+ serializers["trees"] = lambda: tree_dicts
+
+ return util.to_bytes(serializers, [])
+
+ def to_disk(self, path, **kwargs) -> "EditTrees":
+ path = util.ensure_path(path)
+ with path.open("wb") as file_:
+ file_.write(self.to_bytes())
+
+ def from_disk(self, path, **kwargs) -> "EditTrees":
+ path = util.ensure_path(path)
+ if path.exists():
+ with path.open("rb") as file_:
+ data = file_.read()
+ return self.from_bytes(data)
+
+ return self
+
+ def __getitem__(self, idx):
+ return _tree2dict(self.trees[idx])
+
+ def __len__(self):
+ return self.trees.size()
+
+ def _rebuild_tree_map(self):
+ """Rebuild the tree hash -> tree id mapping"""
+ cdef EditTreeC c_tree
+ cdef uint32_t tree_id
+ cdef hash_t tree_hash
+
+ self.map.clear()
+
+ for tree_id in range(self.trees.size()):
+ c_tree = self.trees[tree_id]
+ tree_hash = edittree_hash(c_tree)
+ self.map.insert(pair[hash_t, uint32_t](tree_hash, tree_id))
+
+ def __reduce__(self):
+ return (unpickle_edittrees, (self.strings, self.to_bytes()))
+
+
+def unpickle_edittrees(strings, trees_data):
+ return EditTrees(strings).from_bytes(trees_data)
+
+
+def _tree2dict(tree):
+ if tree["is_match_node"]:
+ tree = tree["inner"]["match_node"]
+ else:
+ tree = tree["inner"]["subst_node"]
+ return(dict(tree))
+
+def _dict2tree(tree):
+ errors = validate_edit_tree(tree)
+ if errors:
+ raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
+
+ tree = dict(tree)
+ if "prefix_len" in tree:
+ tree = {"is_match_node": True, "inner": {"match_node": tree}}
+ else:
+ tree = {"is_match_node": False, "inner": {"subst_node": tree}}
+
+ return tree
diff --git a/spacy/pipeline/_edit_tree_internals/schemas.py b/spacy/pipeline/_edit_tree_internals/schemas.py
new file mode 100644
index 000000000..c01d0632e
--- /dev/null
+++ b/spacy/pipeline/_edit_tree_internals/schemas.py
@@ -0,0 +1,44 @@
+from typing import Any, Dict, List, Union
+from collections import defaultdict
+from pydantic import BaseModel, Field, ValidationError
+from pydantic.types import StrictBool, StrictInt, StrictStr
+
+
+class MatchNodeSchema(BaseModel):
+ prefix_len: StrictInt = Field(..., title="Prefix length")
+ suffix_len: StrictInt = Field(..., title="Suffix length")
+ prefix_tree: StrictInt = Field(..., title="Prefix tree")
+ suffix_tree: StrictInt = Field(..., title="Suffix tree")
+
+ class Config:
+ extra = "forbid"
+
+
+class SubstNodeSchema(BaseModel):
+ orig: Union[int, StrictStr] = Field(..., title="Original substring")
+ subst: Union[int, StrictStr] = Field(..., title="Replacement substring")
+
+ class Config:
+ extra = "forbid"
+
+
+class EditTreeSchema(BaseModel):
+ __root__: Union[MatchNodeSchema, SubstNodeSchema]
+
+
+def validate_edit_tree(obj: Dict[str, Any]) -> List[str]:
+ """Validate edit tree.
+
+ obj (Dict[str, Any]): JSON-serializable data to validate.
+ RETURNS (List[str]): A list of error messages, if available.
+ """
+ try:
+ EditTreeSchema.parse_obj(obj)
+ return []
+ except ValidationError as e:
+ errors = e.errors()
+ data = defaultdict(list)
+ for error in errors:
+ err_loc = " -> ".join([str(p) for p in error.get("loc", [])])
+ data[err_loc].append(error.get("msg"))
+ return [f"[{loc}] {', '.join(msg)}" for loc, msg in data.items()] # type: ignore[arg-type]
diff --git a/spacy/pipeline/_parser_internals/arc_eager.pyx b/spacy/pipeline/_parser_internals/arc_eager.pyx
index 029e2e29e..d60f1c3e6 100644
--- a/spacy/pipeline/_parser_internals/arc_eager.pyx
+++ b/spacy/pipeline/_parser_internals/arc_eager.pyx
@@ -218,7 +218,7 @@ def _get_aligned_sent_starts(example):
sent_starts = [False] * len(example.x)
seen_words = set()
for y_sent in example.y.sents:
- x_indices = list(align[y_sent.start : y_sent.end].dataXd)
+ x_indices = list(align[y_sent.start : y_sent.end])
if any(x_idx in seen_words for x_idx in x_indices):
# If there are any tokens in X that align across two sentences,
# regard the sentence annotations as missing, as we can't
@@ -824,7 +824,7 @@ cdef class ArcEager(TransitionSystem):
for i in range(self.n_moves):
print(self.get_class_name(i), is_valid[i], costs[i])
print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1)))
- raise ValueError("Could not find gold transition - see logs above.")
+ raise ValueError(Errors.E1031)
def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None):
cdef int i
diff --git a/spacy/pipeline/_parser_internals/nonproj.pyx b/spacy/pipeline/_parser_internals/nonproj.pyx
index 82070cd27..36163fcc3 100644
--- a/spacy/pipeline/_parser_internals/nonproj.pyx
+++ b/spacy/pipeline/_parser_internals/nonproj.pyx
@@ -4,6 +4,10 @@ for doing pseudo-projective parsing implementation uses the HEAD decoration
scheme.
"""
from copy import copy
+from libc.limits cimport INT_MAX
+from libc.stdlib cimport abs
+from libcpp cimport bool
+from libcpp.vector cimport vector
from ...tokens.doc cimport Doc, set_children_from_heads
@@ -41,13 +45,18 @@ def contains_cycle(heads):
def is_nonproj_arc(tokenid, heads):
+ cdef vector[int] c_heads = _heads_to_c(heads)
+ return _is_nonproj_arc(tokenid, c_heads)
+
+
+cdef bool _is_nonproj_arc(int tokenid, const vector[int]& heads) nogil:
# definition (e.g. Havelka 2007): an arc h -> d, h < d is non-projective
# if there is a token k, h < k < d such that h is not
# an ancestor of k. Same for h -> d, h > d
head = heads[tokenid]
if head == tokenid: # root arcs cannot be non-projective
return False
- elif head is None: # unattached tokens cannot be non-projective
+ elif head < 0: # unattached tokens cannot be non-projective
return False
cdef int start, end
@@ -56,19 +65,29 @@ def is_nonproj_arc(tokenid, heads):
else:
start, end = (tokenid+1, head)
for k in range(start, end):
- for ancestor in ancestors(k, heads):
- if ancestor is None: # for unattached tokens/subtrees
- break
- elif ancestor == head: # normal case: k dominated by h
- break
+ if _has_head_as_ancestor(k, head, heads):
+ continue
else: # head not in ancestors: d -> h is non-projective
return True
return False
+cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads) nogil:
+ ancestor = tokenid
+ cnt = 0
+ while cnt < heads.size():
+ if heads[ancestor] == head or heads[ancestor] < 0:
+ return True
+ ancestor = heads[ancestor]
+ cnt += 1
+
+ return False
+
+
def is_nonproj_tree(heads):
+ cdef vector[int] c_heads = _heads_to_c(heads)
# a tree is non-projective if at least one arc is non-projective
- return any(is_nonproj_arc(word, heads) for word in range(len(heads)))
+ return any(_is_nonproj_arc(word, c_heads) for word in range(len(heads)))
def decompose(label):
@@ -98,16 +117,31 @@ def projectivize(heads, labels):
# tree, i.e. connected and cycle-free. Returns a new pair (heads, labels)
# which encode a projective and decorated tree.
proj_heads = copy(heads)
- smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
- if smallest_np_arc is None: # this sentence is already projective
+
+ cdef int new_head
+ cdef vector[int] c_proj_heads = _heads_to_c(proj_heads)
+ cdef int smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
+ if smallest_np_arc == -1: # this sentence is already projective
return proj_heads, copy(labels)
- while smallest_np_arc is not None:
- _lift(smallest_np_arc, proj_heads)
- smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
+ while smallest_np_arc != -1:
+ new_head = _lift(smallest_np_arc, proj_heads)
+ c_proj_heads[smallest_np_arc] = new_head
+ smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
deco_labels = _decorate(heads, proj_heads, labels)
return proj_heads, deco_labels
+cdef vector[int] _heads_to_c(heads):
+ cdef vector[int] c_heads;
+ for head in heads:
+ if head == None:
+ c_heads.push_back(-1)
+ else:
+ assert head < len(heads)
+ c_heads.push_back(head)
+ return c_heads
+
+
cpdef deprojectivize(Doc doc):
# Reattach arcs with decorated labels (following HEAD scheme). For each
# decorated arc X||Y, search top-down, left-to-right, breadth-first until
@@ -137,27 +171,38 @@ def _decorate(heads, proj_heads, labels):
deco_labels.append(labels[tokenid])
return deco_labels
+def get_smallest_nonproj_arc_slow(heads):
+ cdef vector[int] c_heads = _heads_to_c(heads)
+ return _get_smallest_nonproj_arc(c_heads)
-def _get_smallest_nonproj_arc(heads):
+
+cdef int _get_smallest_nonproj_arc(const vector[int]& heads) nogil:
# return the smallest non-proj arc or None
# where size is defined as the distance between dep and head
# and ties are broken left to right
- smallest_size = float('inf')
- smallest_np_arc = None
- for tokenid, head in enumerate(heads):
+ cdef int smallest_size = INT_MAX
+ cdef int smallest_np_arc = -1
+ cdef int size
+ cdef int tokenid
+ cdef int head
+
+ for tokenid in range(heads.size()):
+ head = heads[tokenid]
size = abs(tokenid-head)
- if size < smallest_size and is_nonproj_arc(tokenid, heads):
+ if size < smallest_size and _is_nonproj_arc(tokenid, heads):
smallest_size = size
smallest_np_arc = tokenid
return smallest_np_arc
-def _lift(tokenid, heads):
+cpdef int _lift(tokenid, heads):
# reattaches a word to it's grandfather
head = heads[tokenid]
ghead = heads[head]
+ cdef int new_head = ghead if head != ghead else tokenid
# attach to ghead if head isn't attached to root else attach to root
- heads[tokenid] = ghead if head != ghead else tokenid
+ heads[tokenid] = new_head
+ return new_head
def _find_new_head(token, headlabel):
diff --git a/spacy/pipeline/edit_tree_lemmatizer.py b/spacy/pipeline/edit_tree_lemmatizer.py
new file mode 100644
index 000000000..54a7030dc
--- /dev/null
+++ b/spacy/pipeline/edit_tree_lemmatizer.py
@@ -0,0 +1,379 @@
+from typing import cast, Any, Callable, Dict, Iterable, List, Optional
+from typing import Sequence, Tuple, Union
+from collections import Counter
+from copy import deepcopy
+from itertools import islice
+import numpy as np
+
+import srsly
+from thinc.api import Config, Model, SequenceCategoricalCrossentropy
+from thinc.types import Floats2d, Ints1d, Ints2d
+
+from ._edit_tree_internals.edit_trees import EditTrees
+from ._edit_tree_internals.schemas import validate_edit_tree
+from .lemmatizer import lemmatizer_score
+from .trainable_pipe import TrainablePipe
+from ..errors import Errors
+from ..language import Language
+from ..tokens import Doc
+from ..training import Example, validate_examples, validate_get_examples
+from ..vocab import Vocab
+from .. import util
+
+
+default_model_config = """
+[model]
+@architectures = "spacy.Tagger.v2"
+
+[model.tok2vec]
+@architectures = "spacy.HashEmbedCNN.v2"
+pretrained_vectors = null
+width = 96
+depth = 4
+embed_size = 2000
+window_size = 1
+maxout_pieces = 3
+subword_features = true
+"""
+DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
+
+
+@Language.factory(
+ "trainable_lemmatizer",
+ assigns=["token.lemma"],
+ requires=[],
+ default_config={
+ "model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
+ "backoff": "orth",
+ "min_tree_freq": 3,
+ "overwrite": False,
+ "top_k": 1,
+ "scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
+ },
+ default_score_weights={"lemma_acc": 1.0},
+)
+def make_edit_tree_lemmatizer(
+ nlp: Language,
+ name: str,
+ model: Model,
+ backoff: Optional[str],
+ min_tree_freq: int,
+ overwrite: bool,
+ top_k: int,
+ scorer: Optional[Callable],
+):
+ """Construct an EditTreeLemmatizer component."""
+ return EditTreeLemmatizer(
+ nlp.vocab,
+ model,
+ name,
+ backoff=backoff,
+ min_tree_freq=min_tree_freq,
+ overwrite=overwrite,
+ top_k=top_k,
+ scorer=scorer,
+ )
+
+
+class EditTreeLemmatizer(TrainablePipe):
+ """
+ Lemmatizer that lemmatizes each word using a predicted edit tree.
+ """
+
+ def __init__(
+ self,
+ vocab: Vocab,
+ model: Model,
+ name: str = "trainable_lemmatizer",
+ *,
+ backoff: Optional[str] = "orth",
+ min_tree_freq: int = 3,
+ overwrite: bool = False,
+ top_k: int = 1,
+ scorer: Optional[Callable] = lemmatizer_score,
+ ):
+ """
+ Construct an edit tree lemmatizer.
+
+ backoff (Optional[str]): backoff to use when the predicted edit trees
+ are not applicable. Must be an attribute of Token or None (leave the
+ lemma unset).
+ min_tree_freq (int): prune trees that are applied less than this
+ frequency in the training data.
+ overwrite (bool): overwrite existing lemma annotations.
+ top_k (int): try to apply at most the k most probable edit trees.
+ """
+ self.vocab = vocab
+ self.model = model
+ self.name = name
+ self.backoff = backoff
+ self.min_tree_freq = min_tree_freq
+ self.overwrite = overwrite
+ self.top_k = top_k
+
+ self.trees = EditTrees(self.vocab.strings)
+ self.tree2label: Dict[int, int] = {}
+
+ self.cfg: Dict[str, Any] = {"labels": []}
+ self.scorer = scorer
+
+ def get_loss(
+ self, examples: Iterable[Example], scores: List[Floats2d]
+ ) -> Tuple[float, List[Floats2d]]:
+ validate_examples(examples, "EditTreeLemmatizer.get_loss")
+ loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
+
+ truths = []
+ for eg in examples:
+ eg_truths = []
+ for (predicted, gold_lemma) in zip(
+ eg.predicted, eg.get_aligned("LEMMA", as_string=True)
+ ):
+ if gold_lemma is None:
+ label = -1
+ else:
+ tree_id = self.trees.add(predicted.text, gold_lemma)
+ label = self.tree2label.get(tree_id, 0)
+ eg_truths.append(label)
+
+ truths.append(eg_truths)
+
+ d_scores, loss = loss_func(scores, truths) # type: ignore
+ if self.model.ops.xp.isnan(loss):
+ raise ValueError(Errors.E910.format(name=self.name))
+
+ return float(loss), d_scores
+
+ def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
+ n_docs = len(list(docs))
+ if not any(len(doc) for doc in docs):
+ # Handle cases where there are no tokens in any docs.
+ n_labels = len(self.cfg["labels"])
+ guesses: List[Ints2d] = [
+ self.model.ops.alloc((0, n_labels), dtype="i") for doc in docs
+ ]
+ assert len(guesses) == n_docs
+ return guesses
+ scores = self.model.predict(docs)
+ assert len(scores) == n_docs
+ guesses = self._scores2guesses(docs, scores)
+ assert len(guesses) == n_docs
+ return guesses
+
+ def _scores2guesses(self, docs, scores):
+ guesses = []
+ for doc, doc_scores in zip(docs, scores):
+ if self.top_k == 1:
+ doc_guesses = doc_scores.argmax(axis=1).reshape(-1, 1)
+ else:
+ doc_guesses = np.argsort(doc_scores)[..., : -self.top_k - 1 : -1]
+
+ if not isinstance(doc_guesses, np.ndarray):
+ doc_guesses = doc_guesses.get()
+
+ doc_compat_guesses = []
+ for token, candidates in zip(doc, doc_guesses):
+ tree_id = -1
+ for candidate in candidates:
+ candidate_tree_id = self.cfg["labels"][candidate]
+
+ if self.trees.apply(candidate_tree_id, token.text) is not None:
+ tree_id = candidate_tree_id
+ break
+ doc_compat_guesses.append(tree_id)
+
+ guesses.append(np.array(doc_compat_guesses))
+
+ return guesses
+
+ def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
+ for i, doc in enumerate(docs):
+ doc_tree_ids = batch_tree_ids[i]
+ if hasattr(doc_tree_ids, "get"):
+ doc_tree_ids = doc_tree_ids.get()
+ for j, tree_id in enumerate(doc_tree_ids):
+ if self.overwrite or doc[j].lemma == 0:
+ # If no applicable tree could be found during prediction,
+ # the special identifier -1 is used. Otherwise the tree
+ # is guaranteed to be applicable.
+ if tree_id == -1:
+ if self.backoff is not None:
+ doc[j].lemma = getattr(doc[j], self.backoff)
+ else:
+ lemma = self.trees.apply(tree_id, doc[j].text)
+ doc[j].lemma_ = lemma
+
+ @property
+ def labels(self) -> Tuple[int, ...]:
+ """Returns the labels currently added to the component."""
+ return tuple(self.cfg["labels"])
+
+ @property
+ def hide_labels(self) -> bool:
+ return True
+
+ @property
+ def label_data(self) -> Dict:
+ trees = []
+ for tree_id in range(len(self.trees)):
+ tree = self.trees[tree_id]
+ if "orig" in tree:
+ tree["orig"] = self.vocab.strings[tree["orig"]]
+ if "subst" in tree:
+ tree["subst"] = self.vocab.strings[tree["subst"]]
+ trees.append(tree)
+ return dict(trees=trees, labels=tuple(self.cfg["labels"]))
+
+ def initialize(
+ self,
+ get_examples: Callable[[], Iterable[Example]],
+ *,
+ nlp: Optional[Language] = None,
+ labels: Optional[Dict] = None,
+ ):
+ validate_get_examples(get_examples, "EditTreeLemmatizer.initialize")
+
+ if labels is None:
+ self._labels_from_data(get_examples)
+ else:
+ self._add_labels(labels)
+
+ # Sample for the model.
+ doc_sample = []
+ label_sample = []
+ for example in islice(get_examples(), 10):
+ doc_sample.append(example.x)
+ gold_labels: List[List[float]] = []
+ for token in example.reference:
+ if token.lemma == 0:
+ gold_label = None
+ else:
+ gold_label = self._pair2label(token.text, token.lemma_)
+
+ gold_labels.append(
+ [
+ 1.0 if label == gold_label else 0.0
+ for label in self.cfg["labels"]
+ ]
+ )
+
+ gold_labels = cast(Floats2d, gold_labels)
+ label_sample.append(self.model.ops.asarray(gold_labels, dtype="float32"))
+
+ self._require_labels()
+ assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
+ assert len(label_sample) > 0, Errors.E923.format(name=self.name)
+
+ self.model.initialize(X=doc_sample, Y=label_sample)
+
+ def from_bytes(self, bytes_data, *, exclude=tuple()):
+ deserializers = {
+ "cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
+ "model": lambda b: self.model.from_bytes(b),
+ "vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
+ "trees": lambda b: self.trees.from_bytes(b),
+ }
+
+ util.from_bytes(bytes_data, deserializers, exclude)
+
+ return self
+
+ def to_bytes(self, *, exclude=tuple()):
+ serializers = {
+ "cfg": lambda: srsly.json_dumps(self.cfg),
+ "model": lambda: self.model.to_bytes(),
+ "vocab": lambda: self.vocab.to_bytes(exclude=exclude),
+ "trees": lambda: self.trees.to_bytes(),
+ }
+
+ return util.to_bytes(serializers, exclude)
+
+ def to_disk(self, path, exclude=tuple()):
+ path = util.ensure_path(path)
+ serializers = {
+ "cfg": lambda p: srsly.write_json(p, self.cfg),
+ "model": lambda p: self.model.to_disk(p),
+ "vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
+ "trees": lambda p: self.trees.to_disk(p),
+ }
+ util.to_disk(path, serializers, exclude)
+
+ def from_disk(self, path, exclude=tuple()):
+ def load_model(p):
+ try:
+ with open(p, "rb") as mfile:
+ self.model.from_bytes(mfile.read())
+ except AttributeError:
+ raise ValueError(Errors.E149) from None
+
+ deserializers = {
+ "cfg": lambda p: self.cfg.update(srsly.read_json(p)),
+ "model": load_model,
+ "vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
+ "trees": lambda p: self.trees.from_disk(p),
+ }
+
+ util.from_disk(path, deserializers, exclude)
+ return self
+
+ def _add_labels(self, labels: Dict):
+ if "labels" not in labels:
+ raise ValueError(Errors.E857.format(name="labels"))
+ if "trees" not in labels:
+ raise ValueError(Errors.E857.format(name="trees"))
+
+ self.cfg["labels"] = list(labels["labels"])
+ trees = []
+ for tree in labels["trees"]:
+ errors = validate_edit_tree(tree)
+ if errors:
+ raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
+
+ tree = dict(tree)
+ if "orig" in tree:
+ tree["orig"] = self.vocab.strings[tree["orig"]]
+ if "orig" in tree:
+ tree["subst"] = self.vocab.strings[tree["subst"]]
+
+ trees.append(tree)
+
+ self.trees.from_json(trees)
+
+ for label, tree in enumerate(self.labels):
+ self.tree2label[tree] = label
+
+ def _labels_from_data(self, get_examples: Callable[[], Iterable[Example]]):
+ # Count corpus tree frequencies in ad-hoc storage to avoid cluttering
+ # the final pipe/string store.
+ vocab = Vocab()
+ trees = EditTrees(vocab.strings)
+ tree_freqs: Counter = Counter()
+ repr_pairs: Dict = {}
+ for example in get_examples():
+ for token in example.reference:
+ if token.lemma != 0:
+ tree_id = trees.add(token.text, token.lemma_)
+ tree_freqs[tree_id] += 1
+ repr_pairs[tree_id] = (token.text, token.lemma_)
+
+ # Construct trees that make the frequency cut-off using representative
+ # form - token pairs.
+ for tree_id, freq in tree_freqs.items():
+ if freq >= self.min_tree_freq:
+ form, lemma = repr_pairs[tree_id]
+ self._pair2label(form, lemma, add_label=True)
+
+ def _pair2label(self, form, lemma, add_label=False):
+ """
+ Look up the edit tree identifier for a form/label pair. If the edit
+ tree is unknown and "add_label" is set, the edit tree will be added to
+ the labels.
+ """
+ tree_id = self.trees.add(form, lemma)
+ if tree_id not in self.tree2label:
+ if not add_label:
+ return None
+
+ self.tree2label[tree_id] = len(self.cfg["labels"])
+ self.cfg["labels"].append(tree_id)
+ return self.tree2label[tree_id]
diff --git a/spacy/pipeline/entity_linker.py b/spacy/pipeline/entity_linker.py
index 1169e898d..89e7576bf 100644
--- a/spacy/pipeline/entity_linker.py
+++ b/spacy/pipeline/entity_linker.py
@@ -6,17 +6,17 @@ import srsly
import random
from thinc.api import CosineDistance, Model, Optimizer, Config
from thinc.api import set_dropout_rate
-import warnings
from ..kb import KnowledgeBase, Candidate
from ..ml import empty_kb
from ..tokens import Doc, Span
from .pipe import deserialize_config
+from .legacy.entity_linker import EntityLinker_v1
from .trainable_pipe import TrainablePipe
from ..language import Language
from ..vocab import Vocab
from ..training import Example, validate_examples, validate_get_examples
-from ..errors import Errors, Warnings
+from ..errors import Errors
from ..util import SimpleFrozenList, registry
from .. import util
from ..scorer import Scorer
@@ -26,7 +26,7 @@ BACKWARD_OVERWRITE = True
default_model_config = """
[model]
-@architectures = "spacy.EntityLinker.v1"
+@architectures = "spacy.EntityLinker.v2"
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
@@ -55,6 +55,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
"overwrite": True,
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
+ "use_gold_ents": True,
},
default_score_weights={
"nel_micro_f": 1.0,
@@ -75,6 +76,7 @@ def make_entity_linker(
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool,
scorer: Optional[Callable],
+ use_gold_ents: bool,
):
"""Construct an EntityLinker component.
@@ -90,6 +92,22 @@ def make_entity_linker(
produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method.
"""
+
+ if not model.attrs.get("include_span_maker", False):
+ # The only difference in arguments here is that use_gold_ents is not available
+ return EntityLinker_v1(
+ nlp.vocab,
+ model,
+ name,
+ labels_discard=labels_discard,
+ n_sents=n_sents,
+ incl_prior=incl_prior,
+ incl_context=incl_context,
+ entity_vector_length=entity_vector_length,
+ get_candidates=get_candidates,
+ overwrite=overwrite,
+ scorer=scorer,
+ )
return EntityLinker(
nlp.vocab,
model,
@@ -102,6 +120,7 @@ def make_entity_linker(
get_candidates=get_candidates,
overwrite=overwrite,
scorer=scorer,
+ use_gold_ents=use_gold_ents,
)
@@ -136,6 +155,7 @@ class EntityLinker(TrainablePipe):
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool = BACKWARD_OVERWRITE,
scorer: Optional[Callable] = entity_linker_score,
+ use_gold_ents: bool,
) -> None:
"""Initialize an entity linker.
@@ -152,6 +172,8 @@ class EntityLinker(TrainablePipe):
produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_links.
+ use_gold_ents (bool): Whether to copy entities from gold docs or not. If false, another
+ component must provide entity annotations.
DOCS: https://spacy.io/api/entitylinker#init
"""
@@ -169,6 +191,7 @@ class EntityLinker(TrainablePipe):
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab)
self.scorer = scorer
+ self.use_gold_ents = use_gold_ents
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will
@@ -212,14 +235,48 @@ class EntityLinker(TrainablePipe):
doc_sample = []
vector_sample = []
for example in islice(get_examples(), 10):
- doc_sample.append(example.x)
+ doc = example.x
+ if self.use_gold_ents:
+ doc.ents = example.y.ents
+ doc_sample.append(doc)
vector_sample.append(self.model.ops.alloc1f(nO))
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
+
+ # XXX In order for size estimation to work, there has to be at least
+ # one entity. It's not used for training so it doesn't have to be real,
+ # so we add a fake one if none are present.
+ # We can't use Doc.has_annotation here because it can be True for docs
+ # that have been through an NER component but got no entities.
+ has_annotations = any([doc.ents for doc in doc_sample])
+ if not has_annotations:
+ doc = doc_sample[0]
+ ent = doc[0:1]
+ ent.label_ = "XXX"
+ doc.ents = (ent,)
+
self.model.initialize(
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
)
+ if not has_annotations:
+ # Clean up dummy annotation
+ doc.ents = []
+
+ def batch_has_learnable_example(self, examples):
+ """Check if a batch contains a learnable example.
+
+ If one isn't present, then the update step needs to be skipped.
+ """
+
+ for eg in examples:
+ for ent in eg.predicted.ents:
+ candidates = list(self.get_candidates(self.kb, ent))
+ if candidates:
+ return True
+
+ return False
+
def update(
self,
examples: Iterable[Example],
@@ -247,35 +304,29 @@ class EntityLinker(TrainablePipe):
if not examples:
return losses
validate_examples(examples, "EntityLinker.update")
- sentence_docs = []
- for eg in examples:
- sentences = [s for s in eg.reference.sents]
- kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
- for ent in eg.reference.ents:
- # KB ID of the first token is the same as the whole span
- kb_id = kb_ids[ent.start]
- if kb_id:
- try:
- # find the sentence in the list of sentences.
- sent_index = sentences.index(ent.sent)
- except AttributeError:
- # Catch the exception when ent.sent is None and provide a user-friendly warning
- raise RuntimeError(Errors.E030) from None
- # get n previous sentences, if there are any
- start_sentence = max(0, sent_index - self.n_sents)
- # get n posterior sentences, or as many < n as there are
- end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
- # get token positions
- start_token = sentences[start_sentence].start
- end_token = sentences[end_sentence].end
- # append that span as a doc to training
- sent_doc = eg.predicted[start_token:end_token].as_doc()
- sentence_docs.append(sent_doc)
+
set_dropout_rate(self.model, drop)
- if not sentence_docs:
- warnings.warn(Warnings.W093.format(name="Entity Linker"))
+ docs = [eg.predicted for eg in examples]
+ # save to restore later
+ old_ents = [doc.ents for doc in docs]
+
+ for doc, ex in zip(docs, examples):
+ if self.use_gold_ents:
+ doc.ents = ex.reference.ents
+ else:
+ # only keep matching ents
+ doc.ents = ex.get_matching_ents()
+
+ # make sure we have something to learn from, if not, short-circuit
+ if not self.batch_has_learnable_example(examples):
return losses
- sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
+
+ sentence_encodings, bp_context = self.model.begin_update(docs)
+
+ # now restore the ents
+ for doc, old in zip(docs, old_ents):
+ doc.ents = old
+
loss, d_scores = self.get_loss(
sentence_encodings=sentence_encodings, examples=examples
)
@@ -288,24 +339,38 @@ class EntityLinker(TrainablePipe):
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
validate_examples(examples, "EntityLinker.get_loss")
entity_encodings = []
+ eidx = 0 # indices in gold entities to keep
+ keep_ents = [] # indices in sentence_encodings to keep
+
for eg in examples:
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
+
for ent in eg.reference.ents:
kb_id = kb_ids[ent.start]
if kb_id:
entity_encoding = self.kb.get_vector(kb_id)
entity_encodings.append(entity_encoding)
+ keep_ents.append(eidx)
+
+ eidx += 1
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
- if sentence_encodings.shape != entity_encodings.shape:
+ selected_encodings = sentence_encodings[keep_ents]
+
+ # If the entity encodings list is empty, then
+ if selected_encodings.shape != entity_encodings.shape:
err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up"
)
raise RuntimeError(err)
# TODO: fix typing issue here
- gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore
- loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
+ gradients = self.distance.get_grad(selected_encodings, entity_encodings) # type: ignore
+ # to match the input size, we need to give a zero gradient for items not in the kb
+ out = self.model.ops.alloc2f(*sentence_encodings.shape)
+ out[keep_ents] = gradients
+
+ loss = self.distance.get_loss(selected_encodings, entity_encodings) # type: ignore
loss = loss / len(entity_encodings)
- return float(loss), gradients
+ return float(loss), out
def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them.
diff --git a/spacy/pipeline/legacy/__init__.py b/spacy/pipeline/legacy/__init__.py
new file mode 100644
index 000000000..f216840dc
--- /dev/null
+++ b/spacy/pipeline/legacy/__init__.py
@@ -0,0 +1,3 @@
+from .entity_linker import EntityLinker_v1
+
+__all__ = ["EntityLinker_v1"]
diff --git a/spacy/pipeline/legacy/entity_linker.py b/spacy/pipeline/legacy/entity_linker.py
new file mode 100644
index 000000000..6440c18e5
--- /dev/null
+++ b/spacy/pipeline/legacy/entity_linker.py
@@ -0,0 +1,427 @@
+# This file is present to provide a prior version of the EntityLinker component
+# for backwards compatability. For details see #9669.
+
+from typing import Optional, Iterable, Callable, Dict, Union, List, Any
+from thinc.types import Floats2d
+from pathlib import Path
+from itertools import islice
+import srsly
+import random
+from thinc.api import CosineDistance, Model, Optimizer, Config
+from thinc.api import set_dropout_rate
+import warnings
+
+from ...kb import KnowledgeBase, Candidate
+from ...ml import empty_kb
+from ...tokens import Doc, Span
+from ..pipe import deserialize_config
+from ..trainable_pipe import TrainablePipe
+from ...language import Language
+from ...vocab import Vocab
+from ...training import Example, validate_examples, validate_get_examples
+from ...errors import Errors, Warnings
+from ...util import SimpleFrozenList, registry
+from ... import util
+from ...scorer import Scorer
+
+# See #9050
+BACKWARD_OVERWRITE = True
+
+
+def entity_linker_score(examples, **kwargs):
+ return Scorer.score_links(examples, negative_labels=[EntityLinker_v1.NIL], **kwargs)
+
+
+class EntityLinker_v1(TrainablePipe):
+ """Pipeline component for named entity linking.
+
+ DOCS: https://spacy.io/api/entitylinker
+ """
+
+ NIL = "NIL" # string used to refer to a non-existing link
+
+ def __init__(
+ self,
+ vocab: Vocab,
+ model: Model,
+ name: str = "entity_linker",
+ *,
+ labels_discard: Iterable[str],
+ n_sents: int,
+ incl_prior: bool,
+ incl_context: bool,
+ entity_vector_length: int,
+ get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
+ overwrite: bool = BACKWARD_OVERWRITE,
+ scorer: Optional[Callable] = entity_linker_score,
+ ) -> None:
+ """Initialize an entity linker.
+
+ vocab (Vocab): The shared vocabulary.
+ model (thinc.api.Model): The Thinc Model powering the pipeline component.
+ name (str): The component instance name, used to add entries to the
+ losses during training.
+ labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
+ n_sents (int): The number of neighbouring sentences to take into account.
+ incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
+ incl_context (bool): Whether or not to include the local context in the model.
+ entity_vector_length (int): Size of encoding vectors in the KB.
+ get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
+ produces a list of candidates, given a certain knowledge base and a textual mention.
+ scorer (Optional[Callable]): The scoring method. Defaults to
+ Scorer.score_links.
+
+ DOCS: https://spacy.io/api/entitylinker#init
+ """
+ self.vocab = vocab
+ self.model = model
+ self.name = name
+ self.labels_discard = list(labels_discard)
+ self.n_sents = n_sents
+ self.incl_prior = incl_prior
+ self.incl_context = incl_context
+ self.get_candidates = get_candidates
+ self.cfg: Dict[str, Any] = {"overwrite": overwrite}
+ self.distance = CosineDistance(normalize=False)
+ # how many neighbour sentences to take into account
+ # create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
+ self.kb = empty_kb(entity_vector_length)(self.vocab)
+ self.scorer = scorer
+
+ def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
+ """Define the KB of this pipe by providing a function that will
+ create it using this object's vocab."""
+ if not callable(kb_loader):
+ raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
+
+ self.kb = kb_loader(self.vocab)
+
+ def validate_kb(self) -> None:
+ # Raise an error if the knowledge base is not initialized.
+ if self.kb is None:
+ raise ValueError(Errors.E1018.format(name=self.name))
+ if len(self.kb) == 0:
+ raise ValueError(Errors.E139.format(name=self.name))
+
+ def initialize(
+ self,
+ get_examples: Callable[[], Iterable[Example]],
+ *,
+ nlp: Optional[Language] = None,
+ kb_loader: Optional[Callable[[Vocab], KnowledgeBase]] = None,
+ ):
+ """Initialize the pipe for training, using a representative set
+ of data examples.
+
+ get_examples (Callable[[], Iterable[Example]]): Function that
+ returns a representative sample of gold-standard Example objects.
+ nlp (Language): The current nlp object the component is part of.
+ kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates a KnowledgeBase from a Vocab instance.
+ Note that providing this argument, will overwrite all data accumulated in the current KB.
+ Use this only when loading a KB as-such from file.
+
+ DOCS: https://spacy.io/api/entitylinker#initialize
+ """
+ validate_get_examples(get_examples, "EntityLinker_v1.initialize")
+ if kb_loader is not None:
+ self.set_kb(kb_loader)
+ self.validate_kb()
+ nO = self.kb.entity_vector_length
+ doc_sample = []
+ vector_sample = []
+ for example in islice(get_examples(), 10):
+ doc_sample.append(example.x)
+ vector_sample.append(self.model.ops.alloc1f(nO))
+ assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
+ assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
+ self.model.initialize(
+ X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
+ )
+
+ def update(
+ self,
+ examples: Iterable[Example],
+ *,
+ drop: float = 0.0,
+ sgd: Optional[Optimizer] = None,
+ losses: Optional[Dict[str, float]] = None,
+ ) -> Dict[str, float]:
+ """Learn from a batch of documents and gold-standard information,
+ updating the pipe's model. Delegates to predict and get_loss.
+
+ examples (Iterable[Example]): A batch of Example objects.
+ drop (float): The dropout rate.
+ sgd (thinc.api.Optimizer): The optimizer.
+ losses (Dict[str, float]): Optional record of the loss during training.
+ Updated using the component name as the key.
+ RETURNS (Dict[str, float]): The updated losses dictionary.
+
+ DOCS: https://spacy.io/api/entitylinker#update
+ """
+ self.validate_kb()
+ if losses is None:
+ losses = {}
+ losses.setdefault(self.name, 0.0)
+ if not examples:
+ return losses
+ validate_examples(examples, "EntityLinker_v1.update")
+ sentence_docs = []
+ for eg in examples:
+ sentences = [s for s in eg.reference.sents]
+ kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
+ for ent in eg.reference.ents:
+ # KB ID of the first token is the same as the whole span
+ kb_id = kb_ids[ent.start]
+ if kb_id:
+ try:
+ # find the sentence in the list of sentences.
+ sent_index = sentences.index(ent.sent)
+ except AttributeError:
+ # Catch the exception when ent.sent is None and provide a user-friendly warning
+ raise RuntimeError(Errors.E030) from None
+ # get n previous sentences, if there are any
+ start_sentence = max(0, sent_index - self.n_sents)
+ # get n posterior sentences, or as many < n as there are
+ end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
+ # get token positions
+ start_token = sentences[start_sentence].start
+ end_token = sentences[end_sentence].end
+ # append that span as a doc to training
+ sent_doc = eg.predicted[start_token:end_token].as_doc()
+ sentence_docs.append(sent_doc)
+ set_dropout_rate(self.model, drop)
+ if not sentence_docs:
+ warnings.warn(Warnings.W093.format(name="Entity Linker"))
+ return losses
+ sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
+ loss, d_scores = self.get_loss(
+ sentence_encodings=sentence_encodings, examples=examples
+ )
+ bp_context(d_scores)
+ if sgd is not None:
+ self.finish_update(sgd)
+ losses[self.name] += loss
+ return losses
+
+ def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
+ validate_examples(examples, "EntityLinker_v1.get_loss")
+ entity_encodings = []
+ for eg in examples:
+ kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
+ for ent in eg.reference.ents:
+ kb_id = kb_ids[ent.start]
+ if kb_id:
+ entity_encoding = self.kb.get_vector(kb_id)
+ entity_encodings.append(entity_encoding)
+ entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
+ if sentence_encodings.shape != entity_encodings.shape:
+ err = Errors.E147.format(
+ method="get_loss", msg="gold entities do not match up"
+ )
+ raise RuntimeError(err)
+ # TODO: fix typing issue here
+ gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore
+ loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
+ loss = loss / len(entity_encodings)
+ return float(loss), gradients
+
+ def predict(self, docs: Iterable[Doc]) -> List[str]:
+ """Apply the pipeline's model to a batch of docs, without modifying them.
+ Returns the KB IDs for each entity in each doc, including NIL if there is
+ no prediction.
+
+ docs (Iterable[Doc]): The documents to predict.
+ RETURNS (List[str]): The models prediction for each document.
+
+ DOCS: https://spacy.io/api/entitylinker#predict
+ """
+ self.validate_kb()
+ entity_count = 0
+ final_kb_ids: List[str] = []
+ if not docs:
+ return final_kb_ids
+ if isinstance(docs, Doc):
+ docs = [docs]
+ for i, doc in enumerate(docs):
+ sentences = [s for s in doc.sents]
+ if len(doc) > 0:
+ # Looping through each entity (TODO: rewrite)
+ for ent in doc.ents:
+ sent = ent.sent
+ sent_index = sentences.index(sent)
+ assert sent_index >= 0
+ # get n_neighbour sentences, clipped to the length of the document
+ start_sentence = max(0, sent_index - self.n_sents)
+ end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
+ start_token = sentences[start_sentence].start
+ end_token = sentences[end_sentence].end
+ sent_doc = doc[start_token:end_token].as_doc()
+ # currently, the context is the same for each entity in a sentence (should be refined)
+ xp = self.model.ops.xp
+ if self.incl_context:
+ sentence_encoding = self.model.predict([sent_doc])[0]
+ sentence_encoding_t = sentence_encoding.T
+ sentence_norm = xp.linalg.norm(sentence_encoding_t)
+ entity_count += 1
+ if ent.label_ in self.labels_discard:
+ # ignoring this entity - setting to NIL
+ final_kb_ids.append(self.NIL)
+ else:
+ candidates = list(self.get_candidates(self.kb, ent))
+ if not candidates:
+ # no prediction possible for this entity - setting to NIL
+ final_kb_ids.append(self.NIL)
+ elif len(candidates) == 1:
+ # shortcut for efficiency reasons: take the 1 candidate
+ # TODO: thresholding
+ final_kb_ids.append(candidates[0].entity_)
+ else:
+ random.shuffle(candidates)
+ # set all prior probabilities to 0 if incl_prior=False
+ prior_probs = xp.asarray([c.prior_prob for c in candidates])
+ if not self.incl_prior:
+ prior_probs = xp.asarray([0.0 for _ in candidates])
+ scores = prior_probs
+ # add in similarity from the context
+ if self.incl_context:
+ entity_encodings = xp.asarray(
+ [c.entity_vector for c in candidates]
+ )
+ entity_norm = xp.linalg.norm(entity_encodings, axis=1)
+ if len(entity_encodings) != len(prior_probs):
+ raise RuntimeError(
+ Errors.E147.format(
+ method="predict",
+ msg="vectors not of equal length",
+ )
+ )
+ # cosine similarity
+ sims = xp.dot(entity_encodings, sentence_encoding_t) / (
+ sentence_norm * entity_norm
+ )
+ if sims.shape != prior_probs.shape:
+ raise ValueError(Errors.E161)
+ scores = prior_probs + sims - (prior_probs * sims)
+ # TODO: thresholding
+ best_index = scores.argmax().item()
+ best_candidate = candidates[best_index]
+ final_kb_ids.append(best_candidate.entity_)
+ if not (len(final_kb_ids) == entity_count):
+ err = Errors.E147.format(
+ method="predict", msg="result variables not of equal length"
+ )
+ raise RuntimeError(err)
+ return final_kb_ids
+
+ def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
+ """Modify a batch of documents, using pre-computed scores.
+
+ docs (Iterable[Doc]): The documents to modify.
+ kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict.
+
+ DOCS: https://spacy.io/api/entitylinker#set_annotations
+ """
+ count_ents = len([ent for doc in docs for ent in doc.ents])
+ if count_ents != len(kb_ids):
+ raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
+ i = 0
+ overwrite = self.cfg["overwrite"]
+ for doc in docs:
+ for ent in doc.ents:
+ kb_id = kb_ids[i]
+ i += 1
+ for token in ent:
+ if token.ent_kb_id == 0 or overwrite:
+ token.ent_kb_id_ = kb_id
+
+ def to_bytes(self, *, exclude=tuple()):
+ """Serialize the pipe to a bytestring.
+
+ exclude (Iterable[str]): String names of serialization fields to exclude.
+ RETURNS (bytes): The serialized object.
+
+ DOCS: https://spacy.io/api/entitylinker#to_bytes
+ """
+ self._validate_serialization_attrs()
+ serialize = {}
+ if hasattr(self, "cfg") and self.cfg is not None:
+ serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
+ serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
+ serialize["kb"] = self.kb.to_bytes
+ serialize["model"] = self.model.to_bytes
+ return util.to_bytes(serialize, exclude)
+
+ def from_bytes(self, bytes_data, *, exclude=tuple()):
+ """Load the pipe from a bytestring.
+
+ exclude (Iterable[str]): String names of serialization fields to exclude.
+ RETURNS (TrainablePipe): The loaded object.
+
+ DOCS: https://spacy.io/api/entitylinker#from_bytes
+ """
+ self._validate_serialization_attrs()
+
+ def load_model(b):
+ try:
+ self.model.from_bytes(b)
+ except AttributeError:
+ raise ValueError(Errors.E149) from None
+
+ deserialize = {}
+ if hasattr(self, "cfg") and self.cfg is not None:
+ deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
+ deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
+ deserialize["kb"] = lambda b: self.kb.from_bytes(b)
+ deserialize["model"] = load_model
+ util.from_bytes(bytes_data, deserialize, exclude)
+ return self
+
+ def to_disk(
+ self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
+ ) -> None:
+ """Serialize the pipe to disk.
+
+ path (str / Path): Path to a directory.
+ exclude (Iterable[str]): String names of serialization fields to exclude.
+
+ DOCS: https://spacy.io/api/entitylinker#to_disk
+ """
+ serialize = {}
+ serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
+ serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
+ serialize["kb"] = lambda p: self.kb.to_disk(p)
+ serialize["model"] = lambda p: self.model.to_disk(p)
+ util.to_disk(path, serialize, exclude)
+
+ def from_disk(
+ self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
+ ) -> "EntityLinker_v1":
+ """Load the pipe from disk. Modifies the object in place and returns it.
+
+ path (str / Path): Path to a directory.
+ exclude (Iterable[str]): String names of serialization fields to exclude.
+ RETURNS (EntityLinker): The modified EntityLinker object.
+
+ DOCS: https://spacy.io/api/entitylinker#from_disk
+ """
+
+ def load_model(p):
+ try:
+ with p.open("rb") as infile:
+ self.model.from_bytes(infile.read())
+ except AttributeError:
+ raise ValueError(Errors.E149) from None
+
+ deserialize: Dict[str, Callable[[Any], Any]] = {}
+ deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
+ deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
+ deserialize["kb"] = lambda p: self.kb.from_disk(p)
+ deserialize["model"] = load_model
+ util.from_disk(path, deserialize, exclude)
+ return self
+
+ def rehearse(self, examples, *, sgd=None, losses=None, **config):
+ raise NotImplementedError
+
+ def add_label(self, label):
+ raise NotImplementedError
diff --git a/spacy/pipeline/morphologizer.pyx b/spacy/pipeline/morphologizer.pyx
index 73d3799b1..24f98508f 100644
--- a/spacy/pipeline/morphologizer.pyx
+++ b/spacy/pipeline/morphologizer.pyx
@@ -25,7 +25,7 @@ BACKWARD_EXTEND = False
default_model_config = """
[model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
diff --git a/spacy/pipeline/senter.pyx b/spacy/pipeline/senter.pyx
index 6d00e829d..6808fe70e 100644
--- a/spacy/pipeline/senter.pyx
+++ b/spacy/pipeline/senter.pyx
@@ -20,7 +20,7 @@ BACKWARD_OVERWRITE = False
default_model_config = """
[model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
diff --git a/spacy/pipeline/spancat.py b/spacy/pipeline/spancat.py
index f5522f2d3..0a6138fbc 100644
--- a/spacy/pipeline/spancat.py
+++ b/spacy/pipeline/spancat.py
@@ -272,6 +272,24 @@ class SpanCategorizer(TrainablePipe):
scores = self.model.predict((docs, indices)) # type: ignore
return indices, scores
+ def set_candidates(
+ self, docs: Iterable[Doc], *, candidates_key: str = "candidates"
+ ) -> None:
+ """Use the spancat suggester to add a list of span candidates to a list of docs.
+ This method is intended to be used for debugging purposes.
+
+ docs (Iterable[Doc]): The documents to modify.
+ candidates_key (str): Key of the Doc.spans dict to save the candidate spans under.
+
+ DOCS: https://spacy.io/api/spancategorizer#set_candidates
+ """
+ suggester_output = self.suggester(docs, ops=self.model.ops)
+
+ for candidates, doc in zip(suggester_output, docs): # type: ignore
+ doc.spans[candidates_key] = []
+ for index in candidates.dataXd:
+ doc.spans[candidates_key].append(doc[index[0] : index[1]])
+
def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None:
"""Modify a batch of Doc objects, using pre-computed scores.
@@ -378,7 +396,7 @@ class SpanCategorizer(TrainablePipe):
# If the prediction is 0.9 and it's false, the gradient will be
# 0.9 (0.9 - 0.0)
d_scores = scores - target
- loss = float((d_scores ** 2).sum())
+ loss = float((d_scores**2).sum())
return loss, d_scores
def initialize(
diff --git a/spacy/pipeline/tagger.pyx b/spacy/pipeline/tagger.pyx
index a2bec888e..d6ecbf084 100644
--- a/spacy/pipeline/tagger.pyx
+++ b/spacy/pipeline/tagger.pyx
@@ -27,7 +27,7 @@ BACKWARD_OVERWRITE = False
default_model_config = """
[model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
@@ -225,6 +225,7 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger#rehearse
"""
+ loss_func = SequenceCategoricalCrossentropy()
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
@@ -236,12 +237,12 @@ class Tagger(TrainablePipe):
# Handle cases where there are no tokens in any docs.
return losses
set_dropout_rate(self.model, drop)
- guesses, backprop = self.model.begin_update(docs)
- target = self._rehearsal_model(examples)
- gradient = guesses - target
- backprop(gradient)
+ tag_scores, bp_tag_scores = self.model.begin_update(docs)
+ tutor_tag_scores, _ = self._rehearsal_model.begin_update(docs)
+ grads, loss = loss_func(tag_scores, tutor_tag_scores)
+ bp_tag_scores(grads)
self.finish_update(sgd)
- losses[self.name] += (gradient**2).sum()
+ losses[self.name] += loss
return losses
def get_loss(self, examples, scores):
diff --git a/spacy/pipeline/textcat.py b/spacy/pipeline/textcat.py
index dd5fdc078..bc3f127fc 100644
--- a/spacy/pipeline/textcat.py
+++ b/spacy/pipeline/textcat.py
@@ -283,12 +283,12 @@ class TextCategorizer(TrainablePipe):
return losses
set_dropout_rate(self.model, drop)
scores, bp_scores = self.model.begin_update(docs)
- target = self._rehearsal_model(examples)
+ target, _ = self._rehearsal_model.begin_update(docs)
gradient = scores - target
bp_scores(gradient)
if sgd is not None:
self.finish_update(sgd)
- losses[self.name] += (gradient ** 2).sum()
+ losses[self.name] += (gradient**2).sum()
return losses
def _examples_to_truth(
@@ -320,9 +320,9 @@ class TextCategorizer(TrainablePipe):
self._validate_categories(examples)
truths, not_missing = self._examples_to_truth(examples)
not_missing = self.model.ops.asarray(not_missing) # type: ignore
- d_scores = (scores - truths)
+ d_scores = scores - truths
d_scores *= not_missing
- mean_square_error = (d_scores ** 2).mean()
+ mean_square_error = (d_scores**2).mean()
return float(mean_square_error), d_scores
def add_label(self, label: str) -> int:
diff --git a/spacy/pipeline/tok2vec.py b/spacy/pipeline/tok2vec.py
index cb601e5dc..2e3dde3cb 100644
--- a/spacy/pipeline/tok2vec.py
+++ b/spacy/pipeline/tok2vec.py
@@ -118,6 +118,10 @@ class Tok2Vec(TrainablePipe):
DOCS: https://spacy.io/api/tok2vec#predict
"""
+ if not any(len(doc) for doc in docs):
+ # Handle cases where there are no tokens in any docs.
+ width = self.model.get_dim("nO")
+ return [self.model.ops.alloc((0, width)) for doc in docs]
tokvecs = self.model.predict(docs)
batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners:
diff --git a/spacy/scorer.py b/spacy/scorer.py
index ae9338bd5..8cd755ac4 100644
--- a/spacy/scorer.py
+++ b/spacy/scorer.py
@@ -228,7 +228,7 @@ class Scorer:
if token.orth_.isspace():
continue
if align.x2y.lengths[token.i] == 1:
- gold_i = align.x2y[token.i].dataXd[0, 0]
+ gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices:
pred_tags.add((gold_i, getter(token, attr)))
tag_score.score_set(pred_tags, gold_tags)
@@ -287,7 +287,7 @@ class Scorer:
if token.orth_.isspace():
continue
if align.x2y.lengths[token.i] == 1:
- gold_i = align.x2y[token.i].dataXd[0, 0]
+ gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices:
value = getter(token, attr)
morph = gold_doc.vocab.strings[value]
@@ -694,13 +694,13 @@ class Scorer:
if align.x2y.lengths[token.i] != 1:
gold_i = None # type: ignore
else:
- gold_i = align.x2y[token.i].dataXd[0, 0]
+ gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices:
dep = getter(token, attr)
head = head_getter(token, head_attr)
if dep not in ignore_labels and token.orth_.strip():
if align.x2y.lengths[head.i] == 1:
- gold_head = align.x2y[head.i].dataXd[0, 0]
+ gold_head = align.x2y[head.i][0]
else:
gold_head = None
# None is indistinct, so we can't just add it to the set
@@ -750,7 +750,7 @@ def get_ner_prf(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
for pred_ent in eg.x.ents:
if pred_ent.label_ not in score_per_type:
score_per_type[pred_ent.label_] = PRFScore()
- indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel()
+ indices = align_x2y[pred_ent.start : pred_ent.end]
if len(indices):
g_span = eg.y[indices[0] : indices[-1] + 1]
# Check we aren't missing annotation on this span. If so,
diff --git a/spacy/strings.pyi b/spacy/strings.pyi
index 5b4147e12..b29389b9a 100644
--- a/spacy/strings.pyi
+++ b/spacy/strings.pyi
@@ -1,4 +1,4 @@
-from typing import Optional, Iterable, Iterator, Union, Any
+from typing import Optional, Iterable, Iterator, Union, Any, overload
from pathlib import Path
def get_string_id(key: Union[str, int]) -> int: ...
@@ -7,7 +7,10 @@ class StringStore:
def __init__(
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
) -> None: ...
- def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
+ @overload
+ def __getitem__(self, string_or_id: Union[bytes, str]) -> int: ...
+ @overload
+ def __getitem__(self, string_or_id: int) -> str: ...
def as_int(self, key: Union[bytes, str, int]) -> int: ...
def as_string(self, key: Union[bytes, str, int]) -> str: ...
def add(self, string: str) -> int: ...
diff --git a/spacy/tests/conftest.py b/spacy/tests/conftest.py
index ee90a9f38..db17f1a8f 100644
--- a/spacy/tests/conftest.py
+++ b/spacy/tests/conftest.py
@@ -99,6 +99,11 @@ def de_vocab():
return get_lang_class("de")().vocab
+@pytest.fixture(scope="session")
+def dsb_tokenizer():
+ return get_lang_class("dsb")().tokenizer
+
+
@pytest.fixture(scope="session")
def el_tokenizer():
return get_lang_class("el")().tokenizer
@@ -221,12 +226,30 @@ def ja_tokenizer():
return get_lang_class("ja")().tokenizer
+@pytest.fixture(scope="session")
+def hsb_tokenizer():
+ return get_lang_class("hsb")().tokenizer
+
+
@pytest.fixture(scope="session")
def ko_tokenizer():
pytest.importorskip("natto")
return get_lang_class("ko")().tokenizer
+@pytest.fixture(scope="session")
+def ko_tokenizer_tokenizer():
+ config = {
+ "nlp": {
+ "tokenizer": {
+ "@tokenizers": "spacy.Tokenizer.v1",
+ }
+ }
+ }
+ nlp = get_lang_class("ko").from_config(config)
+ return nlp.tokenizer
+
+
@pytest.fixture(scope="session")
def lb_tokenizer():
return get_lang_class("lb")().tokenizer
@@ -334,6 +357,11 @@ def sv_tokenizer():
return get_lang_class("sv")().tokenizer
+@pytest.fixture(scope="session")
+def ta_tokenizer():
+ return get_lang_class("ta")().tokenizer
+
+
@pytest.fixture(scope="session")
def th_tokenizer():
pytest.importorskip("pythainlp")
diff --git a/spacy/tests/doc/test_doc_api.py b/spacy/tests/doc/test_doc_api.py
index 10700b787..19b554572 100644
--- a/spacy/tests/doc/test_doc_api.py
+++ b/spacy/tests/doc/test_doc_api.py
@@ -1,6 +1,7 @@
import weakref
import numpy
+from numpy.testing import assert_array_equal
import pytest
from thinc.api import NumpyOps, get_current_ops
@@ -634,6 +635,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
+ # can exclude spans
+ m_doc = Doc.from_docs(en_docs, exclude=["spans"])
+ assert "group" not in m_doc.spans
+
+ # can exclude user_data
+ m_doc = Doc.from_docs(en_docs, exclude=["user_data"])
+ assert m_doc.user_data == {}
+
# can merge empty docs
doc = Doc.from_docs([en_tokenizer("")] * 10)
@@ -647,6 +656,20 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans
assert len(m_doc.spans["group"]) == 0
+ # with tensor
+ ops = get_current_ops()
+ for doc in en_docs:
+ doc.tensor = ops.asarray([[len(t.text), 0.0] for t in doc])
+ m_doc = Doc.from_docs(en_docs)
+ assert_array_equal(
+ ops.to_numpy(m_doc.tensor),
+ ops.to_numpy(ops.xp.vstack([doc.tensor for doc in en_docs if len(doc)])),
+ )
+
+ # can exclude tensor
+ m_doc = Doc.from_docs(en_docs, exclude=["tensor"])
+ assert m_doc.tensor.shape == (0,)
+
def test_doc_api_from_docs_ents(en_tokenizer):
texts = ["Merging the docs is fun.", "They don't think alike."]
@@ -684,6 +707,7 @@ def test_has_annotation(en_vocab):
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
for attr in attrs:
assert not doc.has_annotation(attr)
+ assert not doc.has_annotation(attr, require_complete=True)
doc[0].tag_ = "A"
doc[0].pos_ = "X"
@@ -709,6 +733,27 @@ def test_has_annotation(en_vocab):
assert doc.has_annotation(attr, require_complete=True)
+def test_has_annotation_sents(en_vocab):
+ doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
+ attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
+ for attr in attrs:
+ assert not doc.has_annotation(attr)
+ assert not doc.has_annotation(attr, require_complete=True)
+
+ # The first token (index 0) is always assumed to be a sentence start,
+ # and ignored by the check in doc.has_annotation
+
+ doc[1].is_sent_start = False
+ for attr in attrs:
+ assert doc.has_annotation(attr)
+ assert not doc.has_annotation(attr, require_complete=True)
+
+ doc[2].is_sent_start = False
+ for attr in attrs:
+ assert doc.has_annotation(attr)
+ assert doc.has_annotation(attr, require_complete=True)
+
+
def test_is_flags_deprecated(en_tokenizer):
doc = en_tokenizer("test")
with pytest.deprecated_call():
diff --git a/spacy/tests/doc/test_span.py b/spacy/tests/doc/test_span.py
index bdf34c1c1..c0496cabf 100644
--- a/spacy/tests/doc/test_span.py
+++ b/spacy/tests/doc/test_span.py
@@ -655,3 +655,16 @@ def test_span_sents(doc, start, end, expected_sentences, expected_sentences_with
def test_span_sents_not_parsed(doc_not_parsed):
with pytest.raises(ValueError):
list(Span(doc_not_parsed, 0, 3).sents)
+
+
+def test_span_group_copy(doc):
+ doc.spans["test"] = [doc[0:1], doc[2:4]]
+ assert len(doc.spans["test"]) == 2
+ doc_copy = doc.copy()
+ # check that the spans were indeed copied
+ assert len(doc_copy.spans["test"]) == 2
+ # add a new span to the original doc
+ doc.spans["test"].append(doc[3:4])
+ assert len(doc.spans["test"]) == 3
+ # check that the copy spans were not modified and this is an isolated doc
+ assert len(doc_copy.spans["test"]) == 2
diff --git a/spacy/tests/doc/test_span_group.py b/spacy/tests/doc/test_span_group.py
new file mode 100644
index 000000000..8c70a83e1
--- /dev/null
+++ b/spacy/tests/doc/test_span_group.py
@@ -0,0 +1,242 @@
+import pytest
+from random import Random
+from spacy.matcher import Matcher
+from spacy.tokens import Span, SpanGroup
+
+
+@pytest.fixture
+def doc(en_tokenizer):
+ doc = en_tokenizer("0 1 2 3 4 5 6")
+ matcher = Matcher(en_tokenizer.vocab, validate=True)
+
+ # fmt: off
+ matcher.add("4", [[{}, {}, {}, {}]])
+ matcher.add("2", [[{}, {}, ]])
+ matcher.add("1", [[{}, ]])
+ # fmt: on
+ matches = matcher(doc)
+ spans = []
+ for match in matches:
+ spans.append(
+ Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
+ )
+ Random(42).shuffle(spans)
+ doc.spans["SPANS"] = SpanGroup(
+ doc, name="SPANS", attrs={"key": "value"}, spans=spans
+ )
+ return doc
+
+
+@pytest.fixture
+def other_doc(en_tokenizer):
+ doc = en_tokenizer("0 1 2 3 4 5 6")
+ matcher = Matcher(en_tokenizer.vocab, validate=True)
+
+ # fmt: off
+ matcher.add("4", [[{}, {}, {}, {}]])
+ matcher.add("2", [[{}, {}, ]])
+ matcher.add("1", [[{}, ]])
+ # fmt: on
+
+ matches = matcher(doc)
+ spans = []
+ for match in matches:
+ spans.append(
+ Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
+ )
+ Random(42).shuffle(spans)
+ doc.spans["SPANS"] = SpanGroup(
+ doc, name="SPANS", attrs={"key": "value"}, spans=spans
+ )
+ return doc
+
+
+@pytest.fixture
+def span_group(en_tokenizer):
+ doc = en_tokenizer("0 1 2 3 4 5 6")
+ matcher = Matcher(en_tokenizer.vocab, validate=True)
+
+ # fmt: off
+ matcher.add("4", [[{}, {}, {}, {}]])
+ matcher.add("2", [[{}, {}, ]])
+ matcher.add("1", [[{}, ]])
+ # fmt: on
+
+ matches = matcher(doc)
+ spans = []
+ for match in matches:
+ spans.append(
+ Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
+ )
+ Random(42).shuffle(spans)
+ doc.spans["SPANS"] = SpanGroup(
+ doc, name="SPANS", attrs={"key": "value"}, spans=spans
+ )
+
+
+def test_span_group_copy(doc):
+ span_group = doc.spans["SPANS"]
+ clone = span_group.copy()
+ assert clone != span_group
+ assert clone.name == span_group.name
+ assert clone.attrs == span_group.attrs
+ assert len(clone) == len(span_group)
+ assert list(span_group) == list(clone)
+ clone.name = "new_name"
+ clone.attrs["key"] = "new_value"
+ clone.append(Span(doc, 0, 6, "LABEL"))
+ assert clone.name != span_group.name
+ assert clone.attrs != span_group.attrs
+ assert span_group.attrs["key"] == "value"
+ assert list(span_group) != list(clone)
+
+
+def test_span_group_set_item(doc, other_doc):
+ span_group = doc.spans["SPANS"]
+
+ index = 5
+ span = span_group[index]
+ span.label_ = "NEW LABEL"
+ span.kb_id = doc.vocab.strings["KB_ID"]
+
+ assert span_group[index].label != span.label
+ assert span_group[index].kb_id != span.kb_id
+
+ span_group[index] = span
+ assert span_group[index].start == span.start
+ assert span_group[index].end == span.end
+ assert span_group[index].label == span.label
+ assert span_group[index].kb_id == span.kb_id
+ assert span_group[index] == span
+
+ with pytest.raises(IndexError):
+ span_group[-100] = span
+ with pytest.raises(IndexError):
+ span_group[100] = span
+
+ span = Span(other_doc, 0, 2)
+ with pytest.raises(ValueError):
+ span_group[index] = span
+
+
+def test_span_group_has_overlap(doc):
+ span_group = doc.spans["SPANS"]
+ assert span_group.has_overlap
+
+
+def test_span_group_concat(doc, other_doc):
+ span_group_1 = doc.spans["SPANS"]
+ spans = [doc[0:5], doc[0:6]]
+ span_group_2 = SpanGroup(
+ doc,
+ name="MORE_SPANS",
+ attrs={"key": "new_value", "new_key": "new_value"},
+ spans=spans,
+ )
+ span_group_3 = span_group_1._concat(span_group_2)
+ assert span_group_3.name == span_group_1.name
+ assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
+ span_list_expected = list(span_group_1) + list(span_group_2)
+ assert list(span_group_3) == list(span_list_expected)
+
+ # Inplace
+ span_list_expected = list(span_group_1) + list(span_group_2)
+ span_group_3 = span_group_1._concat(span_group_2, inplace=True)
+ assert span_group_3 == span_group_1
+ assert span_group_3.name == span_group_1.name
+ assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
+ assert list(span_group_3) == list(span_list_expected)
+
+ span_group_2 = other_doc.spans["SPANS"]
+ with pytest.raises(ValueError):
+ span_group_1._concat(span_group_2)
+
+
+def test_span_doc_delitem(doc):
+ span_group = doc.spans["SPANS"]
+ length = len(span_group)
+ index = 5
+ span = span_group[index]
+ next_span = span_group[index + 1]
+ del span_group[index]
+ assert len(span_group) == length - 1
+ assert span_group[index] != span
+ assert span_group[index] == next_span
+
+ with pytest.raises(IndexError):
+ del span_group[-100]
+ with pytest.raises(IndexError):
+ del span_group[100]
+
+
+def test_span_group_add(doc):
+ span_group_1 = doc.spans["SPANS"]
+ spans = [doc[0:5], doc[0:6]]
+ span_group_2 = SpanGroup(
+ doc,
+ name="MORE_SPANS",
+ attrs={"key": "new_value", "new_key": "new_value"},
+ spans=spans,
+ )
+
+ span_group_3_expected = span_group_1._concat(span_group_2)
+
+ span_group_3 = span_group_1 + span_group_2
+ assert len(span_group_3) == len(span_group_3_expected)
+ assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
+ assert list(span_group_3) == list(span_group_3_expected)
+
+
+def test_span_group_iadd(doc):
+ span_group_1 = doc.spans["SPANS"].copy()
+ spans = [doc[0:5], doc[0:6]]
+ span_group_2 = SpanGroup(
+ doc,
+ name="MORE_SPANS",
+ attrs={"key": "new_value", "new_key": "new_value"},
+ spans=spans,
+ )
+
+ span_group_1_expected = span_group_1._concat(span_group_2)
+
+ span_group_1 += span_group_2
+ assert len(span_group_1) == len(span_group_1_expected)
+ assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
+ assert list(span_group_1) == list(span_group_1_expected)
+
+ span_group_1 = doc.spans["SPANS"].copy()
+ span_group_1 += spans
+ assert len(span_group_1) == len(span_group_1_expected)
+ assert span_group_1.attrs == {
+ "key": "value",
+ }
+ assert list(span_group_1) == list(span_group_1_expected)
+
+
+def test_span_group_extend(doc):
+ span_group_1 = doc.spans["SPANS"].copy()
+ spans = [doc[0:5], doc[0:6]]
+ span_group_2 = SpanGroup(
+ doc,
+ name="MORE_SPANS",
+ attrs={"key": "new_value", "new_key": "new_value"},
+ spans=spans,
+ )
+
+ span_group_1_expected = span_group_1._concat(span_group_2)
+
+ span_group_1.extend(span_group_2)
+ assert len(span_group_1) == len(span_group_1_expected)
+ assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
+ assert list(span_group_1) == list(span_group_1_expected)
+
+ span_group_1 = doc.spans["SPANS"]
+ span_group_1.extend(spans)
+ assert len(span_group_1) == len(span_group_1_expected)
+ assert span_group_1.attrs == {"key": "value"}
+ assert list(span_group_1) == list(span_group_1_expected)
+
+
+def test_span_group_dealloc(span_group):
+ with pytest.raises(AttributeError):
+ print(span_group.doc)
diff --git a/spacy/tests/doc/test_to_json.py b/spacy/tests/doc/test_to_json.py
index 9ebee6c88..202281654 100644
--- a/spacy/tests/doc/test_to_json.py
+++ b/spacy/tests/doc/test_to_json.py
@@ -1,5 +1,5 @@
import pytest
-from spacy.tokens import Doc
+from spacy.tokens import Doc, Span
@pytest.fixture()
@@ -60,3 +60,13 @@ def test_doc_to_json_underscore_error_serialize(doc):
Doc.set_extension("json_test4", method=lambda doc: doc.text)
with pytest.raises(ValueError):
doc.to_json(underscore=["json_test4"])
+
+
+def test_doc_to_json_span(doc):
+ """Test that Doc.to_json() includes spans"""
+ doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
+ json_doc = doc.to_json()
+ assert "spans" in json_doc
+ assert len(json_doc["spans"]) == 1
+ assert len(json_doc["spans"]["test"]) == 2
+ assert json_doc["spans"]["test"][0]["start"] == 0
diff --git a/spacy/tests/lang/dsb/__init__.py b/spacy/tests/lang/dsb/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/spacy/tests/lang/dsb/test_text.py b/spacy/tests/lang/dsb/test_text.py
new file mode 100644
index 000000000..40f2c15e0
--- /dev/null
+++ b/spacy/tests/lang/dsb/test_text.py
@@ -0,0 +1,25 @@
+import pytest
+
+
+@pytest.mark.parametrize(
+ "text,match",
+ [
+ ("10", True),
+ ("1", True),
+ ("10,000", True),
+ ("10,00", True),
+ ("jadno", True),
+ ("dwanassฤo", True),
+ ("milion", True),
+ ("sto", True),
+ ("ceลa", False),
+ ("kopica", False),
+ ("narฤcow", False),
+ (",", False),
+ ("1/2", True),
+ ],
+)
+def test_lex_attrs_like_number(dsb_tokenizer, text, match):
+ tokens = dsb_tokenizer(text)
+ assert len(tokens) == 1
+ assert tokens[0].like_num == match
diff --git a/spacy/tests/lang/dsb/test_tokenizer.py b/spacy/tests/lang/dsb/test_tokenizer.py
new file mode 100644
index 000000000..135974fb8
--- /dev/null
+++ b/spacy/tests/lang/dsb/test_tokenizer.py
@@ -0,0 +1,29 @@
+import pytest
+
+DSB_BASIC_TOKENIZATION_TESTS = [
+ (
+ "Ale eksistฤrujo mimo togo ceลa kopica narฤcow, ako na pลikลad slฤpjaลska.",
+ [
+ "Ale",
+ "eksistฤrujo",
+ "mimo",
+ "togo",
+ "ceลa",
+ "kopica",
+ "narฤcow",
+ ",",
+ "ako",
+ "na",
+ "pลikลad",
+ "slฤpjaลska",
+ ".",
+ ],
+ ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", DSB_BASIC_TOKENIZATION_TESTS)
+def test_dsb_tokenizer_basic(dsb_tokenizer, text, expected_tokens):
+ tokens = dsb_tokenizer(text)
+ token_list = [token.text for token in tokens if not token.is_space]
+ assert expected_tokens == token_list
diff --git a/spacy/tests/lang/fi/test_noun_chunks.py b/spacy/tests/lang/fi/test_noun_chunks.py
index cc3b5aa36..cab84b311 100644
--- a/spacy/tests/lang/fi/test_noun_chunks.py
+++ b/spacy/tests/lang/fi/test_noun_chunks.py
@@ -107,7 +107,17 @@ FI_NP_TEST_EXAMPLES = [
(
"New York tunnetaan kaupunkina, joka ei koskaan nuku",
["PROPN", "PROPN", "VERB", "NOUN", "PUNCT", "PRON", "AUX", "ADV", "VERB"],
- ["obj", "flat:name", "ROOT", "obl", "punct", "nsubj", "aux", "advmod", "acl:relcl"],
+ [
+ "obj",
+ "flat:name",
+ "ROOT",
+ "obl",
+ "punct",
+ "nsubj",
+ "aux",
+ "advmod",
+ "acl:relcl",
+ ],
[2, -1, 0, -1, 4, 3, 2, 1, -5],
["New York", "kaupunkina"],
),
@@ -130,7 +140,12 @@ FI_NP_TEST_EXAMPLES = [
["NOUN", "VERB", "NOUN", "NOUN", "ADJ", "NOUN"],
["nsubj", "ROOT", "obj", "obl", "amod", "obl"],
[1, 0, -1, -1, 1, -3],
- ["sairaanhoitopiirit", "leikkaustoimintaa", "alueellaan", "useammassa sairaalassa"],
+ [
+ "sairaanhoitopiirit",
+ "leikkaustoimintaa",
+ "alueellaan",
+ "useammassa sairaalassa",
+ ],
),
(
"Lain mukaan varhaiskasvatus on suunnitelmallista toimintaa",
diff --git a/spacy/tests/lang/hsb/__init__.py b/spacy/tests/lang/hsb/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/spacy/tests/lang/hsb/test_text.py b/spacy/tests/lang/hsb/test_text.py
new file mode 100644
index 000000000..aaa4984eb
--- /dev/null
+++ b/spacy/tests/lang/hsb/test_text.py
@@ -0,0 +1,25 @@
+import pytest
+
+
+@pytest.mark.parametrize(
+ "text,match",
+ [
+ ("10", True),
+ ("1", True),
+ ("10,000", True),
+ ("10,00", True),
+ ("jedne", True),
+ ("dwanaฤe", True),
+ ("milion", True),
+ ("sto", True),
+ ("zaลoลพene", False),
+ ("wona", False),
+ ("powลกitkownej", False),
+ (",", False),
+ ("1/2", True),
+ ],
+)
+def test_lex_attrs_like_number(hsb_tokenizer, text, match):
+ tokens = hsb_tokenizer(text)
+ assert len(tokens) == 1
+ assert tokens[0].like_num == match
diff --git a/spacy/tests/lang/hsb/test_tokenizer.py b/spacy/tests/lang/hsb/test_tokenizer.py
new file mode 100644
index 000000000..a3ec89ba0
--- /dev/null
+++ b/spacy/tests/lang/hsb/test_tokenizer.py
@@ -0,0 +1,32 @@
+import pytest
+
+HSB_BASIC_TOKENIZATION_TESTS = [
+ (
+ "Hornjoserbลกฤina wobsteji resp. wobstejeลกe z wjacorych dialektow, kotreลพ so zdลบฤla chฤtro wot so rozeznawachu.",
+ [
+ "Hornjoserbลกฤina",
+ "wobsteji",
+ "resp.",
+ "wobstejeลกe",
+ "z",
+ "wjacorych",
+ "dialektow",
+ ",",
+ "kotreลพ",
+ "so",
+ "zdลบฤla",
+ "chฤtro",
+ "wot",
+ "so",
+ "rozeznawachu",
+ ".",
+ ],
+ ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", HSB_BASIC_TOKENIZATION_TESTS)
+def test_hsb_tokenizer_basic(hsb_tokenizer, text, expected_tokens):
+ tokens = hsb_tokenizer(text)
+ token_list = [token.text for token in tokens if not token.is_space]
+ assert expected_tokens == token_list
diff --git a/spacy/tests/lang/ko/test_tokenizer.py b/spacy/tests/lang/ko/test_tokenizer.py
index eac309857..6e06e405e 100644
--- a/spacy/tests/lang/ko/test_tokenizer.py
+++ b/spacy/tests/lang/ko/test_tokenizer.py
@@ -47,3 +47,29 @@ def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
def test_ko_empty_doc(ko_tokenizer):
tokens = ko_tokenizer("")
assert len(tokens) == 0
+
+
+@pytest.mark.issue(10535)
+def test_ko_tokenizer_unknown_tag(ko_tokenizer):
+ tokens = ko_tokenizer("๋ฏธ๋ ๋ฆฌํผํฐ")
+ assert tokens[1].pos_ == "X"
+
+
+# fmt: off
+SPACY_TOKENIZER_TESTS = [
+ ("์๋ค.", "์๋ค ."),
+ ("'์'๋", "' ์ ' ๋"),
+ ("๋ถ (ๅฏ) ๋", "๋ถ ( ๅฏ ) ๋"),
+ ("๋ถ(ๅฏ)๋", "๋ถ ( ๅฏ ) ๋"),
+ ("1982~1983.", "1982 ~ 1983 ."),
+ ("์ฌ๊ณผยท๋ฐฐยท๋ณต์ญ์ยท์๋ฐ์ ๋ชจ๋ ๊ณผ์ผ์ด๋ค.", "์ฌ๊ณผ ยท ๋ฐฐ ยท ๋ณต์ญ์ ยท ์๋ฐ์ ๋ชจ๋ ๊ณผ์ผ์ด๋ค ."),
+ ("๊ทธ๋ ๊ตฌ๋~", "๊ทธ๋ ๊ตฌ๋~"),
+ ("ใ9์ ๋ฐ์ ๋น๊ตฌใ,", "ใ 9์ ๋ฐ์ ๋น๊ตฌ ใ ,"),
+]
+# fmt: on
+
+
+@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)
+def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):
+ tokens = [token.text for token in ko_tokenizer_tokenizer(text)]
+ assert tokens == expected_tokens.split()
diff --git a/spacy/tests/lang/ta/__init__.py b/spacy/tests/lang/ta/__init__.py
new file mode 100644
index 000000000..e69de29bb
diff --git a/spacy/tests/lang/ta/test_text.py b/spacy/tests/lang/ta/test_text.py
new file mode 100644
index 000000000..228a14c18
--- /dev/null
+++ b/spacy/tests/lang/ta/test_text.py
@@ -0,0 +1,25 @@
+import pytest
+from spacy.lang.ta import Tamil
+
+# Wikipedia excerpt: https://en.wikipedia.org/wiki/Chennai (Tamil Language)
+TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT = """เฎเฏเฎฉเฏเฎฉเฏ (Chennai) เฎคเฎฎเฎฟเฎดเฏเฎจเฎพเฎเฏเฎเฎฟเฎฉเฏ เฎคเฎฒเฏเฎจเฎเฎฐเฎฎเฏเฎฎเฏ, เฎเฎจเฏเฎคเฎฟเฎฏเฎพเฎตเฎฟเฎฉเฏ เฎจเฎพเฎฉเฏเฎเฎพเฎตเฎคเฏ เฎชเฏเฎฐเฎฟเฎฏ เฎจเฎเฎฐเฎฎเฏเฎฎเฏ เฎเฎเฏเฎฎเฏ. 1996 เฎเฎฎเฏ เฎเฎฃเฏเฎเฏเฎเฏเฎเฏ เฎฎเฏเฎฉเฏเฎฉเฎฐเฏ เฎเฎจเฏเฎจเฎเฎฐเฎฎเฏ, เฎฎเฎคเฎฐเฎพเฎเฏ เฎชเฎเฏเฎเฎฟเฎฉเฎฎเฏ, เฎฎเฏเฎเฏเฎฐเฎพเฎธเฏ (Madras) เฎฎเฎฑเฏเฎฑเฏเฎฎเฏ เฎเฏเฎฉเฏเฎฉเฎชเฏเฎชเฎเฏเฎเฎฟเฎฉเฎฎเฏ เฎเฎฉเฏเฎฑเฏเฎฎเฏ เฎ เฎดเฏเฎเฏเฎเฎชเฏเฎชเฎเฏเฎเฏ เฎตเฎจเฏเฎคเฎคเฏ. เฎเฏเฎฉเฏเฎฉเฏ, เฎตเฎเฏเฎเฎพเฎณ เฎตเฎฟเฎฐเฎฟเฎเฏเฎเฎพเฎตเฎฟเฎฉเฏ เฎเฎฐเฏเฎฏเฎฟเฎฒเฏ เฎ เฎฎเฏเฎจเฏเฎค เฎคเฏเฎฑเฏเฎฎเฏเฎ เฎจเฎเฎฐเฎเฏเฎเฎณเฏเฎณเฏ เฎเฎฉเฏเฎฑเฏ. เฎเฏเฎฎเฎพเฎฐเฏ 10 เฎฎเฎฟเฎฒเฏเฎฒเฎฟเฎฏเฎฉเฏ (เฎเฎฐเฏ เฎเฏเฎเฎฟ) เฎฎเฎเฏเฎเฎณเฏ เฎตเฎพเฎดเฏเฎฎเฏ เฎเฎจเฏเฎจเฎเฎฐเฎฎเฏ, เฎเฎฒเฎเฎฟเฎฉเฏ 35 เฎชเฏเฎฐเฎฟเฎฏ เฎฎเฎพเฎจเฎเฎฐเฎเฏเฎเฎณเฏเฎณเฏ เฎเฎฉเฏเฎฑเฏ. 17เฎเฎฎเฏ เฎจเฏเฎฑเฏเฎฑเฎพเฎฃเฏเฎเฎฟเฎฒเฏ เฎเฎเฏเฎเฎฟเฎฒเฏเฎฏเฎฐเฏ เฎเฏเฎฉเฏเฎฉเฏเฎฏเฎฟเฎฒเฏ เฎเฎพเฎฒเฏ เฎชเฎคเฎฟเฎคเฏเฎคเฎคเฏ เฎฎเฏเฎคเฎฒเฏ, เฎเฏเฎฉเฏเฎฉเฏ เฎจเฎเฎฐเฎฎเฏ เฎเฎฐเฏ เฎฎเฏเฎเฏเฎเฎฟเฎฏ เฎจเฎเฎฐเฎฎเฎพเฎ เฎตเฎณเฎฐเฏเฎจเฏเฎคเฏ เฎตเฎจเฏเฎคเฎฟเฎฐเฏเฎเฏเฎเฎฟเฎฑเฎคเฏ. เฎเฏเฎฉเฏเฎฉเฏ เฎคเฏเฎฉเฏเฎฉเฎฟเฎจเฏเฎคเฎฟเฎฏเฎพเฎตเฎฟเฎฉเฏ เฎตเฎพเฎเฎฒเฎพเฎเฎเฏ เฎเฎฐเฏเฎคเฎชเฏเฎชเฎเฏเฎเฎฟเฎฑเฎคเฏ. เฎเฏเฎฉเฏเฎฉเฏ เฎจเฎเฎฐเฎฟเฎฒเฏ เฎเฎณเฏเฎณ เฎฎเฏเฎฐเฎฟเฎฉเฎพ เฎเฎเฎฑเฏเฎเฎฐเฏ เฎเฎฒเฎเฎฟเฎฉเฏ เฎจเฏเฎณเฎฎเฎพเฎฉ เฎเฎเฎฑเฏเฎเฎฐเฏเฎเฎณเฏเฎณเฏ เฎเฎฉเฏเฎฑเฏ. เฎเฏเฎฉเฏเฎฉเฏ เฎเฏเฎฒเฎฟเฎตเฏเฎเฏ (Kollywood) เฎเฎฉ เฎ เฎฑเฎฟเฎฏเฎชเฏเฎชเฎเฏเฎฎเฏ เฎคเฎฎเฎฟเฎดเฏเฎคเฏ เฎคเฎฟเฎฐเฏเฎชเฏเฎชเฎเฎคเฏ เฎคเฏเฎฑเฏเฎฏเฎฟเฎฉเฏ เฎคเฎพเฎฏเฎเฎฎเฏ เฎเฎเฏเฎฎเฏ. เฎชเฎฒ เฎตเฎฟเฎณเฏเฎฏเฎพเฎเฏเฎเฏ เฎ เฎฐเฎเฏเฎเฎเฏเฎเฎณเฏ เฎเฎณเฏเฎณ เฎเฏเฎฉเฏเฎฉเฏเฎฏเฎฟเฎฒเฏ เฎชเฎฒ เฎตเฎฟเฎณเฏเฎฏเฎพเฎเฏเฎเฏเฎชเฏ เฎชเฏเฎเฏเฎเฎฟเฎเฎณเฏเฎฎเฏ เฎจเฎเฏเฎชเฏเฎฑเฏเฎเฎฟเฎฉเฏเฎฑเฎฉ."""
+
+
+@pytest.mark.parametrize(
+ "text, num_tokens",
+ [(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 23 + 90)], # Punctuation + rest
+)
+def test_long_text(ta_tokenizer, text, num_tokens):
+ tokens = ta_tokenizer(text)
+ assert len(tokens) == num_tokens
+
+
+@pytest.mark.parametrize(
+ "text, num_sents", [(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 9)]
+)
+def test_ta_sentencizer(text, num_sents):
+ nlp = Tamil()
+ nlp.add_pipe("sentencizer")
+
+ doc = nlp(text)
+ assert len(list(doc.sents)) == num_sents
diff --git a/spacy/tests/lang/ta/test_tokenizer.py b/spacy/tests/lang/ta/test_tokenizer.py
new file mode 100644
index 000000000..6ba8a2400
--- /dev/null
+++ b/spacy/tests/lang/ta/test_tokenizer.py
@@ -0,0 +1,188 @@
+import pytest
+from spacy.symbols import ORTH
+from spacy.lang.ta import Tamil
+
+TA_BASIC_TOKENIZATION_TESTS = [
+ (
+ "เฎเฎฟเฎฑเฎฟเฎธเฏเฎคเฏเฎฎเฎธเฏ เฎฎเฎฑเฏเฎฑเฏเฎฎเฏ เฎเฎฉเฎฟเฎฏ เฎชเฏเฎคเฏเฎคเฎพเฎฃเฏเฎเฏ เฎตเฎพเฎดเฏเฎคเฏเฎคเฏเฎเฏเฎเฎณเฏ",
+ ["เฎเฎฟเฎฑเฎฟเฎธเฏเฎคเฏเฎฎเฎธเฏ", "เฎฎเฎฑเฏเฎฑเฏเฎฎเฏ", "เฎเฎฉเฎฟเฎฏ", "เฎชเฏเฎคเฏเฎคเฎพเฎฃเฏเฎเฏ", "เฎตเฎพเฎดเฏเฎคเฏเฎคเฏเฎเฏเฎเฎณเฏ"],
+ ),
+ (
+ "เฎเฎฉเฎเฏเฎเฏ เฎเฎฉเฏ เฎเฏเฎดเฎจเฏเฎคเฏเฎชเฏ เฎชเฎฐเฏเฎตเฎฎเฏ เฎจเฎฟเฎฉเฏเฎตเฎฟเฎฐเฏเฎเฏเฎเฎฟเฎฑเฎคเฏ",
+ ["เฎเฎฉเฎเฏเฎเฏ", "เฎเฎฉเฏ", "เฎเฏเฎดเฎจเฏเฎคเฏเฎชเฏ", "เฎชเฎฐเฏเฎตเฎฎเฏ", "เฎจเฎฟเฎฉเฏเฎตเฎฟเฎฐเฏเฎเฏเฎเฎฟเฎฑเฎคเฏ"],
+ ),
+ ("เฎเฎเฏเฎเฎณเฏ เฎชเฏเฎฏเฎฐเฏ เฎเฎฉเฏเฎฉ?", ["เฎเฎเฏเฎเฎณเฏ", "เฎชเฏเฎฏเฎฐเฏ", "เฎเฎฉเฏเฎฉ", "?"]),
+ (
+ "เฎเฎฑเฎคเฏเฎคเฎพเฎด เฎเฎฒเฎเฏเฎเฏเฎคเฏ เฎคเฎฎเฎฟเฎดเฎฐเฎฟเฎฒเฏ เฎฎเฏเฎฉเฏเฎฑเฎฟเฎฒเฏเฎฐเฏ เฎชเฎเฏเฎเฎฟเฎฉเฎฐเฏ เฎเฎฒเฎเฏเฎเฏเฎฏเฏ เฎตเฎฟเฎเฏเฎเฏ เฎตเฏเฎณเฎฟเฎฏเฏเฎฑเฎฟเฎชเฏ เฎชเฎฟเฎฑ เฎจเฎพเฎเฏเฎเฎณเฎฟเฎฒเฏ เฎตเฎพเฎดเฏเฎเฎฟเฎฉเฏเฎฑเฎฉเฎฐเฏ",
+ [
+ "เฎเฎฑเฎคเฏเฎคเฎพเฎด",
+ "เฎเฎฒเฎเฏเฎเฏเฎคเฏ",
+ "เฎคเฎฎเฎฟเฎดเฎฐเฎฟเฎฒเฏ",
+ "เฎฎเฏเฎฉเฏเฎฑเฎฟเฎฒเฏเฎฐเฏ",
+ "เฎชเฎเฏเฎเฎฟเฎฉเฎฐเฏ",
+ "เฎเฎฒเฎเฏเฎเฏเฎฏเฏ",
+ "เฎตเฎฟเฎเฏเฎเฏ",
+ "เฎตเฏเฎณเฎฟเฎฏเฏเฎฑเฎฟเฎชเฏ",
+ "เฎชเฎฟเฎฑ",
+ "เฎจเฎพเฎเฏเฎเฎณเฎฟเฎฒเฏ",
+ "เฎตเฎพเฎดเฏเฎเฎฟเฎฉเฏเฎฑเฎฉเฎฐเฏ",
+ ],
+ ),
+ (
+ "เฎเฎจเฏเฎค เฎเฎชเฏเฎฉเฏเฎเฎฉเฏ เฎเฏเฎฎเฎพเฎฐเฏ เฎฐเฏ.2,990 เฎฎเฎคเฎฟเฎชเฏเฎชเฏเฎณเฏเฎณ เฎชเฏเฎเฏ เฎฐเฎพเฎเฏเฎเฎฐเฏเฎธเฏ เฎจเฎฟเฎฑเฏเฎตเฎฉเฎคเฏเฎคเฎฟเฎฉเฏ เฎธเฏเฎชเฏเฎฐเฏเฎเฏ เฎชเฏเฎณเฏเฎเฏเฎคเฏ เฎนเฏเฎเฏเฎชเฏเฎฉเฏเฎธเฏ เฎเฎฒเฎตเฎเฎฎเฎพเฎ เฎตเฎดเฎเฏเฎเฎชเฏเฎชเฎเฎตเฏเฎณเฏเฎณเฎคเฏ.",
+ [
+ "เฎเฎจเฏเฎค",
+ "เฎเฎชเฏเฎฉเฏเฎเฎฉเฏ",
+ "เฎเฏเฎฎเฎพเฎฐเฏ",
+ "เฎฐเฏ.2,990",
+ "เฎฎเฎคเฎฟเฎชเฏเฎชเฏเฎณเฏเฎณ",
+ "เฎชเฏเฎเฏ",
+ "เฎฐเฎพเฎเฏเฎเฎฐเฏเฎธเฏ",
+ "เฎจเฎฟเฎฑเฏเฎตเฎฉเฎคเฏเฎคเฎฟเฎฉเฏ",
+ "เฎธเฏเฎชเฏเฎฐเฏเฎเฏ",
+ "เฎชเฏเฎณเฏเฎเฏเฎคเฏ",
+ "เฎนเฏเฎเฏเฎชเฏเฎฉเฏเฎธเฏ",
+ "เฎเฎฒเฎตเฎเฎฎเฎพเฎ",
+ "เฎตเฎดเฎเฏเฎเฎชเฏเฎชเฎเฎตเฏเฎณเฏเฎณเฎคเฏ",
+ ".",
+ ],
+ ),
+ (
+ "เฎฎเฎเฏเฎเฎเฏเฎเฎณเฎชเฏเฎชเฎฟเฎฒเฏ เฎชเฎฒ เฎเฎเฎเฏเฎเฎณเฎฟเฎฒเฏ เฎตเฏเฎเฏเฎเฏเฎคเฏ เฎคเฎฟเฎเฏเฎเฎเฏเฎเฎณเฏเฎเฏเฎเฏ เฎเฎฉเฏเฎฑเฏ เฎ เฎเฎฟเฎเฏเฎเฎฒเฏ เฎจเฎพเฎเฏเฎเฎฒเฏ",
+ [
+ "เฎฎเฎเฏเฎเฎเฏเฎเฎณเฎชเฏเฎชเฎฟเฎฒเฏ",
+ "เฎชเฎฒ",
+ "เฎเฎเฎเฏเฎเฎณเฎฟเฎฒเฏ",
+ "เฎตเฏเฎเฏเฎเฏเฎคเฏ",
+ "เฎคเฎฟเฎเฏเฎเฎเฏเฎเฎณเฏเฎเฏเฎเฏ",
+ "เฎเฎฉเฏเฎฑเฏ",
+ "เฎ เฎเฎฟเฎเฏเฎเฎฒเฏ",
+ "เฎจเฎพเฎเฏเฎเฎฒเฏ",
+ ],
+ ),
+ (
+ "เฎ เฎชเฏเฎฉเฏเฎเฏเฎเฏ เฎฎเฏเฎเฎคเฏเฎคเฏ เฎตเฏเฎคเฏเฎคเฏ เฎ เฎฉเฏเฎฒเฎพเฎเฏ เฎเฏเฎฏเฏเฎฏเฏเฎฎเฏ เฎฎเฏเฎฑเฏ เฎฎเฎฑเฏเฎฑเฏเฎฎเฏ เฎตเฎฟเฎฐเฎฒเฎพเฎฒเฏ เฎคเฏเฎเฏเฎเฏ เฎ เฎฉเฏเฎฒเฎพเฎเฏ เฎเฏเฎฏเฏเฎฏเฏเฎฎเฏ เฎฎเฏเฎฑเฏเฎฏเฏ เฎตเฎพเฎเฏเฎธเฏ เฎเฎชเฏ เฎจเฎฟเฎฑเฏเฎตเฎฉเฎฎเฏ เฎเฎคเฎฑเฏเฎเฏ เฎฎเฏเฎฉเฏ เฎเฎฃเฏเฎเฏเฎชเฎฟเฎเฎฟเฎคเฏเฎคเฎคเฏ",
+ [
+ "เฎ",
+ "เฎชเฏเฎฉเฏเฎเฏเฎเฏ",
+ "เฎฎเฏเฎเฎคเฏเฎคเฏ",
+ "เฎตเฏเฎคเฏเฎคเฏ",
+ "เฎ เฎฉเฏเฎฒเฎพเฎเฏ",
+ "เฎเฏเฎฏเฏเฎฏเฏเฎฎเฏ",
+ "เฎฎเฏเฎฑเฏ",
+ "เฎฎเฎฑเฏเฎฑเฏเฎฎเฏ",
+ "เฎตเฎฟเฎฐเฎฒเฎพเฎฒเฏ",
+ "เฎคเฏเฎเฏเฎเฏ",
+ "เฎ เฎฉเฏเฎฒเฎพเฎเฏ",
+ "เฎเฏเฎฏเฏเฎฏเฏเฎฎเฏ",
+ "เฎฎเฏเฎฑเฏเฎฏเฏ",
+ "เฎตเฎพเฎเฏเฎธเฏ",
+ "เฎเฎชเฏ",
+ "เฎจเฎฟเฎฑเฏเฎตเฎฉเฎฎเฏ",
+ "เฎเฎคเฎฑเฏเฎเฏ",
+ "เฎฎเฏเฎฉเฏ",
+ "เฎเฎฃเฏเฎเฏเฎชเฎฟเฎเฎฟเฎคเฏเฎคเฎคเฏ",
+ ],
+ ),
+ (
+ "เฎเฎคเฏ เฎเฎฐเฏ เฎตเฎพเฎเฏเฎเฎฟเฎฏเฎฎเฏ.",
+ [
+ "เฎเฎคเฏ",
+ "เฎเฎฐเฏ",
+ "เฎตเฎพเฎเฏเฎเฎฟเฎฏเฎฎเฏ",
+ ".",
+ ],
+ ),
+ (
+ "เฎคเฎฉเฏเฎฉเฎพเฎเฏเฎเฎฟ เฎเฎพเฎฐเฏเฎเฎณเฏ เฎเฎพเฎชเฏเฎชเฏเฎเฏเฎเฏ เฎชเฏเฎฑเฏเฎชเฏเฎชเฏ เฎเฎฑเฏเฎชเฎคเฏเฎคเฎฟเฎฏเฎพเฎณเฎฐเฎฟเฎเฎฎเฏ เฎฎเฎพเฎฑเฏเฎฑเฏเฎเฎฟเฎฉเฏเฎฑเฎฉ",
+ [
+ "เฎคเฎฉเฏเฎฉเฎพเฎเฏเฎเฎฟ",
+ "เฎเฎพเฎฐเฏเฎเฎณเฏ",
+ "เฎเฎพเฎชเฏเฎชเฏเฎเฏเฎเฏ",
+ "เฎชเฏเฎฑเฏเฎชเฏเฎชเฏ",
+ "เฎเฎฑเฏเฎชเฎคเฏเฎคเฎฟเฎฏเฎพเฎณเฎฐเฎฟเฎเฎฎเฏ",
+ "เฎฎเฎพเฎฑเฏเฎฑเฏเฎเฎฟเฎฉเฏเฎฑเฎฉ",
+ ],
+ ),
+ (
+ "เฎจเฎเฏเฎชเฎพเฎคเฏ เฎตเฎฟเฎจเฎฟเฎฏเฏเฎ เฎฐเฏเฎชเฏเฎเฏเฎเฎณเฏ เฎคเฎเฏ เฎเฏเฎฏเฏเฎตเฎคเฏ เฎเฎพเฎฉเฏ เฎชเฎฟเฎฐเฎพเฎฉเฏเฎเฎฟเฎธเฏเฎเฏ เฎเฎฐเฏเฎคเฏเฎเฎฟเฎฑเฎคเฏ",
+ [
+ "เฎจเฎเฏเฎชเฎพเฎคเฏ",
+ "เฎตเฎฟเฎจเฎฟเฎฏเฏเฎ",
+ "เฎฐเฏเฎชเฏเฎเฏเฎเฎณเฏ",
+ "เฎคเฎเฏ",
+ "เฎเฏเฎฏเฏเฎตเฎคเฏ",
+ "เฎเฎพเฎฉเฏ",
+ "เฎชเฎฟเฎฐเฎพเฎฉเฏเฎเฎฟเฎธเฏเฎเฏ",
+ "เฎเฎฐเฏเฎคเฏเฎเฎฟเฎฑเฎคเฏ",
+ ],
+ ),
+ (
+ "เฎฒเฎฃเฏเฎเฎฉเฏ เฎเฎเฏเฎเฎฟเฎฏ เฎเฎฐเฎพเฎเฏเฎเฎฟเฎฏเฎคเฏเฎคเฎฟเฎฒเฏ เฎเฎฐเฏ เฎชเฏเฎฐเฎฟเฎฏ เฎจเฎเฎฐเฎฎเฏ.",
+ [
+ "เฎฒเฎฃเฏเฎเฎฉเฏ",
+ "เฎเฎเฏเฎเฎฟเฎฏ",
+ "เฎเฎฐเฎพเฎเฏเฎเฎฟเฎฏเฎคเฏเฎคเฎฟเฎฒเฏ",
+ "เฎเฎฐเฏ",
+ "เฎชเฏเฎฐเฎฟเฎฏ",
+ "เฎจเฎเฎฐเฎฎเฏ",
+ ".",
+ ],
+ ),
+ (
+ "เฎเฎฉเฏเฎฉ เฎตเฏเฎฒเฏ เฎเฏเฎฏเฏเฎเฎฟเฎฑเฏเฎฐเฏเฎเฎณเฏ?",
+ [
+ "เฎเฎฉเฏเฎฉ",
+ "เฎตเฏเฎฒเฏ",
+ "เฎเฏเฎฏเฏเฎเฎฟเฎฑเฏเฎฐเฏเฎเฎณเฏ",
+ "?",
+ ],
+ ),
+ (
+ "เฎเฎจเฏเฎค เฎเฎฒเฏเฎฒเฏเฎฐเฎฟเฎฏเฎฟเฎฒเฏ เฎชเฎเฎฟเฎเฏเฎเฎฟเฎฑเฎพเฎฏเฏ?",
+ [
+ "เฎเฎจเฏเฎค",
+ "เฎเฎฒเฏเฎฒเฏเฎฐเฎฟเฎฏเฎฟเฎฒเฏ",
+ "เฎชเฎเฎฟเฎเฏเฎเฎฟเฎฑเฎพเฎฏเฏ",
+ "?",
+ ],
+ ),
+]
+
+
+@pytest.mark.parametrize("text,expected_tokens", TA_BASIC_TOKENIZATION_TESTS)
+def test_ta_tokenizer_basic(ta_tokenizer, text, expected_tokens):
+ tokens = ta_tokenizer(text)
+ token_list = [token.text for token in tokens]
+ assert expected_tokens == token_list
+
+
+@pytest.mark.parametrize(
+ "text,expected_tokens",
+ [
+ (
+ "เฎเฎชเฏเฎชเฎฟเฎณเฏ เฎจเฎฟเฎฑเฏเฎตเฎฉเฎฎเฏ เฎฏเฏ.เฎเฏ. เฎคเฏเฎเฎเฏเฎ เฎจเฎฟเฎฑเฏเฎตเฎฉเฎคเฏเฎคเฏ เฎเฎฐเฏ เฎฒเฎเฏเฎเฎฎเฏ เฎเฏเฎเฎฟเฎเฏเฎเฏ เฎตเฎพเฎเฏเฎเฎชเฏ เฎชเฎพเฎฐเฏเฎเฏเฎเฎฟเฎฑเฎคเฏ",
+ [
+ "เฎเฎชเฏเฎชเฎฟเฎณเฏ",
+ "เฎจเฎฟเฎฑเฏเฎตเฎฉเฎฎเฏ",
+ "เฎฏเฏ.เฎเฏ.",
+ "เฎคเฏเฎเฎเฏเฎ",
+ "เฎจเฎฟเฎฑเฏเฎตเฎฉเฎคเฏเฎคเฏ",
+ "เฎเฎฐเฏ",
+ "เฎฒเฎเฏเฎเฎฎเฏ",
+ "เฎเฏเฎเฎฟเฎเฏเฎเฏ",
+ "เฎตเฎพเฎเฏเฎเฎชเฏ",
+ "เฎชเฎพเฎฐเฏเฎเฏเฎเฎฟเฎฑเฎคเฏ",
+ ],
+ )
+ ],
+)
+def test_ta_tokenizer_special_case(text, expected_tokens):
+ # Add a special rule to tokenize the initialism "เฎฏเฏ.เฎเฏ." (U.K., as
+ # in the country) as a single token.
+ nlp = Tamil()
+ nlp.tokenizer.add_special_case("เฎฏเฏ.เฎเฏ.", [{ORTH: "เฎฏเฏ.เฎเฏ."}])
+ tokens = nlp(text)
+
+ token_list = [token.text for token in tokens]
+ assert expected_tokens == token_list
diff --git a/spacy/tests/lang/tr/test_text.py b/spacy/tests/lang/tr/test_text.py
index a12971e82..323b11bd1 100644
--- a/spacy/tests/lang/tr/test_text.py
+++ b/spacy/tests/lang/tr/test_text.py
@@ -41,7 +41,7 @@ def test_tr_lex_attrs_like_number_cardinal_ordinal(word):
assert like_num(word)
-@pytest.mark.parametrize("word", ["beล", "yedi", "yedinci", "birinci"])
+@pytest.mark.parametrize("word", ["beล", "yedi", "yedinci", "birinci", "milyonuncu"])
def test_tr_lex_attrs_capitals(word):
assert like_num(word)
assert like_num(word.upper())
diff --git a/spacy/tests/lang/tr/test_tokenizer.py b/spacy/tests/lang/tr/test_tokenizer.py
index 2ceca5068..9f988eae9 100644
--- a/spacy/tests/lang/tr/test_tokenizer.py
+++ b/spacy/tests/lang/tr/test_tokenizer.py
@@ -694,5 +694,4 @@ TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
tokens = tr_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
- print(token_list)
assert expected_tokens == token_list
diff --git a/spacy/tests/package/test_requirements.py b/spacy/tests/package/test_requirements.py
index 75908df59..e20227455 100644
--- a/spacy/tests/package/test_requirements.py
+++ b/spacy/tests/package/test_requirements.py
@@ -12,6 +12,7 @@ def test_build_dependencies():
"flake8",
"hypothesis",
"pre-commit",
+ "black",
"mypy",
"types-dataclasses",
"types-mock",
diff --git a/spacy/tests/parser/test_nonproj.py b/spacy/tests/parser/test_nonproj.py
index 3957e4d77..60d000c44 100644
--- a/spacy/tests/parser/test_nonproj.py
+++ b/spacy/tests/parser/test_nonproj.py
@@ -93,8 +93,8 @@ def test_parser_pseudoprojectivity(en_vocab):
assert nonproj.is_decorated("X") is False
nonproj._lift(0, tree)
assert tree == [2, 2, 2]
- assert nonproj._get_smallest_nonproj_arc(nonproj_tree) == 7
- assert nonproj._get_smallest_nonproj_arc(nonproj_tree2) == 10
+ assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree) == 7
+ assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree2) == 10
# fmt: off
proj_heads, deco_labels = nonproj.projectivize(nonproj_tree, labels)
assert proj_heads == [1, 2, 2, 4, 5, 2, 7, 5, 2]
diff --git a/spacy/tests/pipeline/test_edit_tree_lemmatizer.py b/spacy/tests/pipeline/test_edit_tree_lemmatizer.py
new file mode 100644
index 000000000..cf541e301
--- /dev/null
+++ b/spacy/tests/pipeline/test_edit_tree_lemmatizer.py
@@ -0,0 +1,280 @@
+import pickle
+import pytest
+from hypothesis import given
+import hypothesis.strategies as st
+from spacy import util
+from spacy.lang.en import English
+from spacy.language import Language
+from spacy.pipeline._edit_tree_internals.edit_trees import EditTrees
+from spacy.training import Example
+from spacy.strings import StringStore
+from spacy.util import make_tempdir
+
+
+TRAIN_DATA = [
+ ("She likes green eggs", {"lemmas": ["she", "like", "green", "egg"]}),
+ ("Eat blue ham", {"lemmas": ["eat", "blue", "ham"]}),
+]
+
+PARTIAL_DATA = [
+ # partial annotation
+ ("She likes green eggs", {"lemmas": ["", "like", "green", ""]}),
+ # misaligned partial annotation
+ (
+ "He hates green eggs",
+ {
+ "words": ["He", "hat", "es", "green", "eggs"],
+ "lemmas": ["", "hat", "e", "green", ""],
+ },
+ ),
+]
+
+
+def test_initialize_examples():
+ nlp = Language()
+ lemmatizer = nlp.add_pipe("trainable_lemmatizer")
+ train_examples = []
+ for t in TRAIN_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+ # you shouldn't really call this more than once, but for testing it should be fine
+ nlp.initialize(get_examples=lambda: train_examples)
+ with pytest.raises(TypeError):
+ nlp.initialize(get_examples=lambda: None)
+ with pytest.raises(TypeError):
+ nlp.initialize(get_examples=lambda: train_examples[0])
+ with pytest.raises(TypeError):
+ nlp.initialize(get_examples=lambda: [])
+ with pytest.raises(TypeError):
+ nlp.initialize(get_examples=train_examples)
+
+
+def test_initialize_from_labels():
+ nlp = Language()
+ lemmatizer = nlp.add_pipe("trainable_lemmatizer")
+ lemmatizer.min_tree_freq = 1
+ train_examples = []
+ for t in TRAIN_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+ nlp.initialize(get_examples=lambda: train_examples)
+
+ nlp2 = Language()
+ lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
+ lemmatizer2.initialize(
+ get_examples=lambda: train_examples,
+ labels=lemmatizer.label_data,
+ )
+ assert lemmatizer2.tree2label == {1: 0, 3: 1, 4: 2, 6: 3}
+
+
+def test_no_data():
+ # Test that the lemmatizer provides a nice error when there's no tagging data / labels
+ TEXTCAT_DATA = [
+ ("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
+ ("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
+ ]
+ nlp = English()
+ nlp.add_pipe("trainable_lemmatizer")
+ nlp.add_pipe("textcat")
+
+ train_examples = []
+ for t in TEXTCAT_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+
+ with pytest.raises(ValueError):
+ nlp.initialize(get_examples=lambda: train_examples)
+
+
+def test_incomplete_data():
+ # Test that the lemmatizer works with incomplete information
+ nlp = English()
+ lemmatizer = nlp.add_pipe("trainable_lemmatizer")
+ lemmatizer.min_tree_freq = 1
+ train_examples = []
+ for t in PARTIAL_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+ optimizer = nlp.initialize(get_examples=lambda: train_examples)
+ for i in range(50):
+ losses = {}
+ nlp.update(train_examples, sgd=optimizer, losses=losses)
+ assert losses["trainable_lemmatizer"] < 0.00001
+
+ # test the trained model
+ test_text = "She likes blue eggs"
+ doc = nlp(test_text)
+ assert doc[1].lemma_ == "like"
+ assert doc[2].lemma_ == "blue"
+
+
+def test_overfitting_IO():
+ nlp = English()
+ lemmatizer = nlp.add_pipe("trainable_lemmatizer")
+ lemmatizer.min_tree_freq = 1
+ train_examples = []
+ for t in TRAIN_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+
+ optimizer = nlp.initialize(get_examples=lambda: train_examples)
+
+ for i in range(50):
+ losses = {}
+ nlp.update(train_examples, sgd=optimizer, losses=losses)
+ assert losses["trainable_lemmatizer"] < 0.00001
+
+ test_text = "She likes blue eggs"
+ doc = nlp(test_text)
+ assert doc[0].lemma_ == "she"
+ assert doc[1].lemma_ == "like"
+ assert doc[2].lemma_ == "blue"
+ assert doc[3].lemma_ == "egg"
+
+ # Check model after a {to,from}_disk roundtrip
+ with util.make_tempdir() as tmp_dir:
+ nlp.to_disk(tmp_dir)
+ nlp2 = util.load_model_from_path(tmp_dir)
+ doc2 = nlp2(test_text)
+ assert doc2[0].lemma_ == "she"
+ assert doc2[1].lemma_ == "like"
+ assert doc2[2].lemma_ == "blue"
+ assert doc2[3].lemma_ == "egg"
+
+ # Check model after a {to,from}_bytes roundtrip
+ nlp_bytes = nlp.to_bytes()
+ nlp3 = English()
+ nlp3.add_pipe("trainable_lemmatizer")
+ nlp3.from_bytes(nlp_bytes)
+ doc3 = nlp3(test_text)
+ assert doc3[0].lemma_ == "she"
+ assert doc3[1].lemma_ == "like"
+ assert doc3[2].lemma_ == "blue"
+ assert doc3[3].lemma_ == "egg"
+
+ # Check model after a pickle roundtrip.
+ nlp_bytes = pickle.dumps(nlp)
+ nlp4 = pickle.loads(nlp_bytes)
+ doc4 = nlp4(test_text)
+ assert doc4[0].lemma_ == "she"
+ assert doc4[1].lemma_ == "like"
+ assert doc4[2].lemma_ == "blue"
+ assert doc4[3].lemma_ == "egg"
+
+
+def test_lemmatizer_requires_labels():
+ nlp = English()
+ nlp.add_pipe("trainable_lemmatizer")
+ with pytest.raises(ValueError):
+ nlp.initialize()
+
+
+def test_lemmatizer_label_data():
+ nlp = English()
+ lemmatizer = nlp.add_pipe("trainable_lemmatizer")
+ lemmatizer.min_tree_freq = 1
+ train_examples = []
+ for t in TRAIN_DATA:
+ train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
+
+ nlp.initialize(get_examples=lambda: train_examples)
+
+ nlp2 = English()
+ lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
+ lemmatizer2.initialize(
+ get_examples=lambda: train_examples, labels=lemmatizer.label_data
+ )
+
+ # Verify that the labels and trees are the same.
+ assert lemmatizer.labels == lemmatizer2.labels
+ assert lemmatizer.trees.to_bytes() == lemmatizer2.trees.to_bytes()
+
+
+def test_dutch():
+ strings = StringStore()
+ trees = EditTrees(strings)
+ tree = trees.add("deelt", "delen")
+ assert trees.tree_to_str(tree) == "(m 0 3 () (m 0 2 (s '' 'l') (s 'lt' 'n')))"
+
+ tree = trees.add("gedeeld", "delen")
+ assert (
+ trees.tree_to_str(tree) == "(m 2 3 (s 'ge' '') (m 0 2 (s '' 'l') (s 'ld' 'n')))"
+ )
+
+
+def test_from_to_bytes():
+ strings = StringStore()
+ trees = EditTrees(strings)
+ trees.add("deelt", "delen")
+ trees.add("gedeeld", "delen")
+
+ b = trees.to_bytes()
+
+ trees2 = EditTrees(strings)
+ trees2.from_bytes(b)
+
+ # Verify that the nodes did not change.
+ assert len(trees) == len(trees2)
+ for i in range(len(trees)):
+ assert trees.tree_to_str(i) == trees2.tree_to_str(i)
+
+ # Reinserting the same trees should not add new nodes.
+ trees2.add("deelt", "delen")
+ trees2.add("gedeeld", "delen")
+ assert len(trees) == len(trees2)
+
+
+def test_from_to_disk():
+ strings = StringStore()
+ trees = EditTrees(strings)
+ trees.add("deelt", "delen")
+ trees.add("gedeeld", "delen")
+
+ trees2 = EditTrees(strings)
+ with make_tempdir() as temp_dir:
+ trees_file = temp_dir / "edit_trees.bin"
+ trees.to_disk(trees_file)
+ trees2 = trees2.from_disk(trees_file)
+
+ # Verify that the nodes did not change.
+ assert len(trees) == len(trees2)
+ for i in range(len(trees)):
+ assert trees.tree_to_str(i) == trees2.tree_to_str(i)
+
+ # Reinserting the same trees should not add new nodes.
+ trees2.add("deelt", "delen")
+ trees2.add("gedeeld", "delen")
+ assert len(trees) == len(trees2)
+
+
+@given(st.text(), st.text())
+def test_roundtrip(form, lemma):
+ strings = StringStore()
+ trees = EditTrees(strings)
+ tree = trees.add(form, lemma)
+ assert trees.apply(tree, form) == lemma
+
+
+@given(st.text(alphabet="ab"), st.text(alphabet="ab"))
+def test_roundtrip_small_alphabet(form, lemma):
+ # Test with small alphabets to have more overlap.
+ strings = StringStore()
+ trees = EditTrees(strings)
+ tree = trees.add(form, lemma)
+ assert trees.apply(tree, form) == lemma
+
+
+def test_unapplicable_trees():
+ strings = StringStore()
+ trees = EditTrees(strings)
+ tree3 = trees.add("deelt", "delen")
+
+ # Replacement fails.
+ assert trees.apply(tree3, "deeld") == None
+
+ # Suffix + prefix are too large.
+ assert trees.apply(tree3, "de") == None
+
+
+def test_empty_strings():
+ strings = StringStore()
+ trees = EditTrees(strings)
+ no_change = trees.add("xyz", "xyz")
+ empty = trees.add("", "")
+ assert no_change == empty
diff --git a/spacy/tests/pipeline/test_entity_linker.py b/spacy/tests/pipeline/test_entity_linker.py
index 3740e430e..83d5bf0e2 100644
--- a/spacy/tests/pipeline/test_entity_linker.py
+++ b/spacy/tests/pipeline/test_entity_linker.py
@@ -9,6 +9,9 @@ from spacy.compat import pickle
from spacy.kb import Candidate, KnowledgeBase, get_candidates
from spacy.lang.en import English
from spacy.ml import load_kb
+from spacy.pipeline import EntityLinker
+from spacy.pipeline.legacy import EntityLinker_v1
+from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
from spacy.scorer import Scorer
from spacy.tests.util import make_tempdir
from spacy.tokens import Span
@@ -168,6 +171,45 @@ def test_issue7065_b():
assert doc
+def test_no_entities():
+ # Test that having no entities doesn't crash the model
+ TRAIN_DATA = [
+ (
+ "The sky is blue.",
+ {
+ "sent_starts": [1, 0, 0, 0, 0],
+ },
+ )
+ ]
+ nlp = English()
+ vector_length = 3
+ train_examples = []
+ for text, annotation in TRAIN_DATA:
+ doc = nlp(text)
+ train_examples.append(Example.from_dict(doc, annotation))
+
+ def create_kb(vocab):
+ # create artificial KB
+ mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
+ mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
+ mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
+ return mykb
+
+ # Create and train the Entity Linker
+ entity_linker = nlp.add_pipe("entity_linker", last=True)
+ entity_linker.set_kb(create_kb)
+ optimizer = nlp.initialize(get_examples=lambda: train_examples)
+ for i in range(2):
+ losses = {}
+ nlp.update(train_examples, sgd=optimizer, losses=losses)
+
+ # adding additional components that are required for the entity_linker
+ nlp.add_pipe("sentencizer", first=True)
+
+ # this will run the pipeline on the examples and shouldn't crash
+ results = nlp.evaluate(train_examples)
+
+
def test_partial_links():
# Test that having some entities on the doc without gold links, doesn't crash
TRAIN_DATA = [
@@ -650,7 +692,7 @@ TRAIN_DATA = [
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}),
("Russ Cochran his reprints include EC Comics.",
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
- "entities": [(0, 12, "PERSON")],
+ "entities": [(0, 12, "PERSON"), (34, 43, "ART")],
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}),
("Russ Cochran has been publishing comic art.",
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
@@ -693,6 +735,7 @@ def test_overfitting_IO():
# Create the Entity Linker component and add it to the pipeline
entity_linker = nlp.add_pipe("entity_linker", last=True)
+ assert isinstance(entity_linker, EntityLinker)
entity_linker.set_kb(create_kb)
assert "Q2146908" in entity_linker.vocab.strings
assert "Q2146908" in entity_linker.kb.vocab.strings
@@ -922,3 +965,113 @@ def test_scorer_links():
assert scores["nel_micro_p"] == 2 / 3
assert scores["nel_micro_r"] == 2 / 4
+
+
+# fmt: off
+@pytest.mark.parametrize(
+ "name,config",
+ [
+ ("entity_linker", {"@architectures": "spacy.EntityLinker.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
+ ("entity_linker", {"@architectures": "spacy.EntityLinker.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
+ ],
+)
+# fmt: on
+def test_legacy_architectures(name, config):
+ # Ensure that the legacy architectures still work
+ vector_length = 3
+ nlp = English()
+
+ train_examples = []
+ for text, annotation in TRAIN_DATA:
+ doc = nlp.make_doc(text)
+ train_examples.append(Example.from_dict(doc, annotation))
+
+ def create_kb(vocab):
+ mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
+ mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
+ mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7])
+ mykb.add_alias(
+ alias="Russ Cochran",
+ entities=["Q2146908", "Q7381115"],
+ probabilities=[0.5, 0.5],
+ )
+ return mykb
+
+ entity_linker = nlp.add_pipe(name, config={"model": config})
+ if config["@architectures"] == "spacy.EntityLinker.v1":
+ assert isinstance(entity_linker, EntityLinker_v1)
+ else:
+ assert isinstance(entity_linker, EntityLinker)
+ entity_linker.set_kb(create_kb)
+ optimizer = nlp.initialize(get_examples=lambda: train_examples)
+
+ for i in range(2):
+ losses = {}
+ nlp.update(train_examples, sgd=optimizer, losses=losses)
+
+
+@pytest.mark.parametrize(
+ "patterns",
+ [
+ # perfect case
+ [{"label": "CHARACTER", "pattern": "Kirby"}],
+ # typo for false negative
+ [{"label": "PERSON", "pattern": "Korby"}],
+ # random stuff for false positive
+ [{"label": "IS", "pattern": "is"}, {"label": "COLOR", "pattern": "pink"}],
+ ],
+)
+def test_no_gold_ents(patterns):
+ # test that annotating components work
+ TRAIN_DATA = [
+ (
+ "Kirby is pink",
+ {
+ "links": {(0, 5): {"Q613241": 1.0}},
+ "entities": [(0, 5, "CHARACTER")],
+ "sent_starts": [1, 0, 0],
+ },
+ )
+ ]
+ nlp = English()
+ vector_length = 3
+ train_examples = []
+ for text, annotation in TRAIN_DATA:
+ doc = nlp(text)
+ train_examples.append(Example.from_dict(doc, annotation))
+
+ # Create a ruler to mark entities
+ ruler = nlp.add_pipe("entity_ruler")
+ ruler.add_patterns(patterns)
+
+ # Apply ruler to examples. In a real pipeline this would be an annotating component.
+ for eg in train_examples:
+ eg.predicted = ruler(eg.predicted)
+
+ def create_kb(vocab):
+ # create artificial KB
+ mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
+ mykb.add_entity(entity="Q613241", freq=12, entity_vector=[6, -4, 3])
+ mykb.add_alias("Kirby", ["Q613241"], [0.9])
+ # Placeholder
+ mykb.add_entity(entity="pink", freq=12, entity_vector=[7, 2, -5])
+ mykb.add_alias("pink", ["pink"], [0.9])
+ return mykb
+
+ # Create and train the Entity Linker
+ entity_linker = nlp.add_pipe(
+ "entity_linker", config={"use_gold_ents": False}, last=True
+ )
+ entity_linker.set_kb(create_kb)
+ assert entity_linker.use_gold_ents == False
+
+ optimizer = nlp.initialize(get_examples=lambda: train_examples)
+ for i in range(2):
+ losses = {}
+ nlp.update(train_examples, sgd=optimizer, losses=losses)
+
+ # adding additional components that are required for the entity_linker
+ nlp.add_pipe("sentencizer", first=True)
+
+ # this will run the pipeline on the examples and shouldn't crash
+ results = nlp.evaluate(train_examples)
diff --git a/spacy/tests/pipeline/test_morphologizer.py b/spacy/tests/pipeline/test_morphologizer.py
index 11d6f0477..33696bfd8 100644
--- a/spacy/tests/pipeline/test_morphologizer.py
+++ b/spacy/tests/pipeline/test_morphologizer.py
@@ -184,7 +184,7 @@ def test_overfitting_IO():
token.pos_ = ""
token.set_morph(None)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
- print(nlp.get_pipe("morphologizer").labels)
+ assert nlp.get_pipe("morphologizer").labels is not None
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
diff --git a/spacy/tests/pipeline/test_spancat.py b/spacy/tests/pipeline/test_spancat.py
index 8060bc621..15256a763 100644
--- a/spacy/tests/pipeline/test_spancat.py
+++ b/spacy/tests/pipeline/test_spancat.py
@@ -397,3 +397,25 @@ def test_zero_suggestions():
assert set(spancat.labels) == {"LOC", "PERSON"}
nlp.update(train_examples, sgd=optimizer)
+
+
+def test_set_candidates():
+ nlp = Language()
+ spancat = nlp.add_pipe("spancat", config={"spans_key": SPAN_KEY})
+ train_examples = make_examples(nlp)
+ nlp.initialize(get_examples=lambda: train_examples)
+ texts = [
+ "Just a sentence.",
+ "I like London and Berlin",
+ "I like Berlin",
+ "I eat ham.",
+ ]
+
+ docs = [nlp(text) for text in texts]
+ spancat.set_candidates(docs)
+
+ assert len(docs) == len(texts)
+ assert type(docs[0].spans["candidates"]) == SpanGroup
+ assert len(docs[0].spans["candidates"]) == 9
+ assert docs[0].spans["candidates"][0].text == "Just"
+ assert docs[0].spans["candidates"][4].text == "Just a"
diff --git a/spacy/tests/pipeline/test_tok2vec.py b/spacy/tests/pipeline/test_tok2vec.py
index eeea906bb..37104c78a 100644
--- a/spacy/tests/pipeline/test_tok2vec.py
+++ b/spacy/tests/pipeline/test_tok2vec.py
@@ -11,7 +11,7 @@ from spacy.lang.en import English
from thinc.api import Config, get_current_ops
from numpy.testing import assert_array_equal
-from ..util import get_batch, make_tempdir
+from ..util import get_batch, make_tempdir, add_vecs_to_vocab
def test_empty_doc():
@@ -100,7 +100,7 @@ cfg_string = """
factory = "tagger"
[components.tagger.model]
- @architectures = "spacy.Tagger.v1"
+ @architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
@@ -140,9 +140,25 @@ TRAIN_DATA = [
]
-def test_tok2vec_listener():
+@pytest.mark.parametrize("with_vectors", (False, True))
+def test_tok2vec_listener(with_vectors):
orig_config = Config().from_str(cfg_string)
+ orig_config["components"]["tok2vec"]["model"]["embed"][
+ "include_static_vectors"
+ ] = with_vectors
nlp = util.load_model_from_config(orig_config, auto_fill=True, validate=True)
+
+ if with_vectors:
+ ops = get_current_ops()
+ vectors = [
+ ("apple", ops.asarray([1, 2, 3])),
+ ("orange", ops.asarray([-1, -2, -3])),
+ ("and", ops.asarray([-1, -1, -1])),
+ ("juice", ops.asarray([5, 5, 10])),
+ ("pie", ops.asarray([7, 6.3, 8.9])),
+ ]
+ add_vecs_to_vocab(nlp.vocab, vectors)
+
assert nlp.pipe_names == ["tok2vec", "tagger"]
tagger = nlp.get_pipe("tagger")
tok2vec = nlp.get_pipe("tok2vec")
@@ -169,6 +185,9 @@ def test_tok2vec_listener():
ops = get_current_ops()
assert_array_equal(ops.to_numpy(doc.tensor), ops.to_numpy(doc_tensor))
+ # test with empty doc
+ doc = nlp("")
+
# TODO: should this warn or error?
nlp.select_pipes(disable="tok2vec")
assert nlp.pipe_names == ["tagger"]
@@ -244,7 +263,7 @@ cfg_string_multi = """
factory = "tagger"
[components.tagger.model]
- @architectures = "spacy.Tagger.v1"
+ @architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
@@ -354,7 +373,7 @@ cfg_string_multi_textcat = """
factory = "tagger"
[components.tagger.model]
- @architectures = "spacy.Tagger.v1"
+ @architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
diff --git a/spacy/tests/serialize/test_serialize_config.py b/spacy/tests/serialize/test_serialize_config.py
index 1d50fd1d1..85e6f8b2c 100644
--- a/spacy/tests/serialize/test_serialize_config.py
+++ b/spacy/tests/serialize/test_serialize_config.py
@@ -59,7 +59,7 @@ subword_features = true
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
@@ -110,7 +110,7 @@ subword_features = true
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
diff --git a/spacy/tests/serialize/test_serialize_language.py b/spacy/tests/serialize/test_serialize_language.py
index 6e7fa0e4e..c03287548 100644
--- a/spacy/tests/serialize/test_serialize_language.py
+++ b/spacy/tests/serialize/test_serialize_language.py
@@ -70,7 +70,7 @@ factory = "ner"
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
diff --git a/spacy/tests/serialize/test_serialize_tokenizer.py b/spacy/tests/serialize/test_serialize_tokenizer.py
index e271f7707..9b74d7721 100644
--- a/spacy/tests/serialize/test_serialize_tokenizer.py
+++ b/spacy/tests/serialize/test_serialize_tokenizer.py
@@ -70,6 +70,7 @@ def test_issue4190():
suffix_search=suffix_re.search,
infix_finditer=infix_re.finditer,
token_match=nlp.tokenizer.token_match,
+ faster_heuristics=False,
)
nlp.tokenizer = new_tokenizer
@@ -90,6 +91,7 @@ def test_issue4190():
doc_2 = nlp_2(test_string)
result_2 = [token.text for token in doc_2]
assert result_1b == result_2
+ assert nlp_2.tokenizer.faster_heuristics is False
def test_serialize_custom_tokenizer(en_vocab, en_tokenizer):
diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py
index 253469909..0fa6f5670 100644
--- a/spacy/tests/test_cli.py
+++ b/spacy/tests/test_cli.py
@@ -12,16 +12,18 @@ from spacy.cli._util import is_subpath_of, load_project_config
from spacy.cli._util import parse_config_overrides, string_to_list
from spacy.cli._util import substitute_project_variables
from spacy.cli._util import validate_project_commands
-from spacy.cli.debug_data import _get_labels_from_model
+from spacy.cli.debug_data import _compile_gold, _get_labels_from_model
from spacy.cli.debug_data import _get_labels_from_spancat
from spacy.cli.download import get_compatibility, get_version
from spacy.cli.init_config import RECOMMENDATIONS, init_config, fill_config
from spacy.cli.package import get_third_party_dependencies
+from spacy.cli.package import _is_permitted_package_name
from spacy.cli.validate import get_model_pkgs
from spacy.lang.en import English
from spacy.lang.nl import Dutch
from spacy.language import Language
from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate
+from spacy.tokens import Doc
from spacy.training import Example, docs_to_json, offsets_to_biluo_tags
from spacy.training.converters import conll_ner_to_docs, conllu_to_docs
from spacy.training.converters import iob_to_docs
@@ -32,7 +34,7 @@ from .util import make_tempdir
@pytest.mark.issue(4665)
-def test_issue4665():
+def test_cli_converters_conllu_empty_heads_ner():
"""
conllu_to_docs should not raise an exception if the HEAD column contains an
underscore
@@ -57,7 +59,11 @@ def test_issue4665():
17 . _ PUNCT . _ _ punct _ _
18 ] _ PUNCT -RRB- _ _ punct _ _
"""
- conllu_to_docs(input_data)
+ docs = list(conllu_to_docs(input_data))
+ # heads are all 0
+ assert not all([t.head.i for t in docs[0]])
+ # NER is unset
+ assert not docs[0].has_annotation("ENT_IOB")
@pytest.mark.issue(4924)
@@ -211,7 +217,6 @@ def test_cli_converters_conllu_to_docs_subtokens():
sent = converted[0]["paragraphs"][0]["sentences"][0]
assert len(sent["tokens"]) == 4
tokens = sent["tokens"]
- print(tokens)
assert [t["orth"] for t in tokens] == ["Dommer", "FE", "avstรฅr", "."]
assert [t["tag"] for t in tokens] == [
"NOUN__Definite=Ind|Gender=Masc|Number=Sing",
@@ -692,3 +697,39 @@ def test_get_labels_from_model(factory_name, pipe_name):
assert _get_labels_from_spancat(nlp)[pipe.key] == set(labels)
else:
assert _get_labels_from_model(nlp, factory_name) == set(labels)
+
+
+def test_permitted_package_names():
+ # https://www.python.org/dev/peps/pep-0426/#name
+ assert _is_permitted_package_name("Meine_Bรคume") == False
+ assert _is_permitted_package_name("_package") == False
+ assert _is_permitted_package_name("package_") == False
+ assert _is_permitted_package_name(".package") == False
+ assert _is_permitted_package_name("package.") == False
+ assert _is_permitted_package_name("-package") == False
+ assert _is_permitted_package_name("package-") == False
+
+
+def test_debug_data_compile_gold():
+ nlp = English()
+ pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
+ ref = Doc(
+ nlp.vocab,
+ words=["Token", ".", "New York City"],
+ sent_starts=[True, False, True],
+ ents=["O", "O", "B-ENT"],
+ )
+ eg = Example(pred, ref)
+ data = _compile_gold([eg], ["ner"], nlp, True)
+ assert data["boundary_cross_ents"] == 0
+
+ pred = Doc(nlp.vocab, words=["Token", ".", "New", "York", "City"])
+ ref = Doc(
+ nlp.vocab,
+ words=["Token", ".", "New York City"],
+ sent_starts=[True, False, True],
+ ents=["O", "B-ENT", "I-ENT"],
+ )
+ eg = Example(pred, ref)
+ data = _compile_gold([eg], ["ner"], nlp, True)
+ assert data["boundary_cross_ents"] == 1
diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py
index 392c95e42..ccc145b44 100644
--- a/spacy/tests/test_displacy.py
+++ b/spacy/tests/test_displacy.py
@@ -83,6 +83,27 @@ def test_issue3882(en_vocab):
displacy.parse_deps(doc)
+@pytest.mark.issue(5447)
+def test_issue5447():
+ """Test that overlapping arcs get separate levels, unless they're identical."""
+ renderer = DependencyRenderer()
+ words = [
+ {"text": "This", "tag": "DT"},
+ {"text": "is", "tag": "VBZ"},
+ {"text": "a", "tag": "DT"},
+ {"text": "sentence.", "tag": "NN"},
+ ]
+ arcs = [
+ {"start": 0, "end": 1, "label": "nsubj", "dir": "left"},
+ {"start": 2, "end": 3, "label": "det", "dir": "left"},
+ {"start": 2, "end": 3, "label": "overlap", "dir": "left"},
+ {"end": 3, "label": "overlap", "start": 2, "dir": "left"},
+ {"start": 1, "end": 3, "label": "attr", "dir": "left"},
+ ]
+ renderer.render([{"words": words, "arcs": arcs}])
+ assert renderer.highest_level == 3
+
+
@pytest.mark.issue(5838)
def test_issue5838():
# Displacy's EntityRenderer break line
@@ -96,6 +117,92 @@ def test_issue5838():
assert found == 4
+def test_displacy_parse_spans(en_vocab):
+ """Test that spans on a Doc are converted into displaCy's format."""
+ doc = Doc(en_vocab, words=["Welcome", "to", "the", "Bank", "of", "China"])
+ doc.spans["sc"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
+ spans = displacy.parse_spans(doc)
+ assert isinstance(spans, dict)
+ assert spans["text"] == "Welcome to the Bank of China "
+ assert spans["spans"] == [
+ {
+ "start": 15,
+ "end": 28,
+ "start_token": 3,
+ "end_token": 6,
+ "label": "ORG",
+ "kb_id": "",
+ "kb_url": "#",
+ },
+ {
+ "start": 23,
+ "end": 28,
+ "start_token": 5,
+ "end_token": 6,
+ "label": "GPE",
+ "kb_id": "",
+ "kb_url": "#",
+ },
+ ]
+
+
+def test_displacy_parse_spans_with_kb_id_options(en_vocab):
+ """Test that spans with kb_id on a Doc are converted into displaCy's format"""
+ doc = Doc(en_vocab, words=["Welcome", "to", "the", "Bank", "of", "China"])
+ doc.spans["sc"] = [
+ Span(doc, 3, 6, "ORG", kb_id="Q790068"),
+ Span(doc, 5, 6, "GPE", kb_id="Q148"),
+ ]
+
+ spans = displacy.parse_spans(
+ doc, {"kb_url_template": "https://wikidata.org/wiki/{}"}
+ )
+ assert isinstance(spans, dict)
+ assert spans["text"] == "Welcome to the Bank of China "
+ assert spans["spans"] == [
+ {
+ "start": 15,
+ "end": 28,
+ "start_token": 3,
+ "end_token": 6,
+ "label": "ORG",
+ "kb_id": "Q790068",
+ "kb_url": "https://wikidata.org/wiki/Q790068",
+ },
+ {
+ "start": 23,
+ "end": 28,
+ "start_token": 5,
+ "end_token": 6,
+ "label": "GPE",
+ "kb_id": "Q148",
+ "kb_url": "https://wikidata.org/wiki/Q148",
+ },
+ ]
+
+
+def test_displacy_parse_spans_different_spans_key(en_vocab):
+ """Test that spans in a different spans key will be parsed"""
+ doc = Doc(en_vocab, words=["Welcome", "to", "the", "Bank", "of", "China"])
+ doc.spans["sc"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
+ doc.spans["custom"] = [Span(doc, 3, 6, "BANK")]
+ spans = displacy.parse_spans(doc, options={"spans_key": "custom"})
+
+ assert isinstance(spans, dict)
+ assert spans["text"] == "Welcome to the Bank of China "
+ assert spans["spans"] == [
+ {
+ "start": 15,
+ "end": 28,
+ "start_token": 3,
+ "end_token": 6,
+ "label": "BANK",
+ "kb_id": "",
+ "kb_url": "#",
+ }
+ ]
+
+
def test_displacy_parse_ents(en_vocab):
"""Test that named entities on a Doc are converted into displaCy's format."""
doc = Doc(en_vocab, words=["But", "Google", "is", "starting", "from", "behind"])
@@ -231,3 +338,18 @@ def test_displacy_options_case():
assert "green" in result[1] and "bar" in result[1]
assert "red" in result[2] and "FOO" in result[2]
assert "green" in result[3] and "BAR" in result[3]
+
+
+@pytest.mark.issue(10672)
+def test_displacy_manual_sorted_entities():
+ doc = {
+ "text": "But Google is starting from behind.",
+ "ents": [
+ {"start": 14, "end": 22, "label": "SECOND"},
+ {"start": 4, "end": 10, "label": "FIRST"},
+ ],
+ "title": None,
+ }
+
+ html = displacy.render(doc, style="ent", manual=True)
+ assert html.find("FIRST") < html.find("SECOND")
diff --git a/spacy/tests/tokenizer/test_tokenizer.py b/spacy/tests/tokenizer/test_tokenizer.py
index a7270cb1e..6af58b344 100644
--- a/spacy/tests/tokenizer/test_tokenizer.py
+++ b/spacy/tests/tokenizer/test_tokenizer.py
@@ -521,3 +521,33 @@ def test_tokenizer_infix_prefix(en_vocab):
assert tokens == ["ยฑ10", "%"]
explain_tokens = [t[1] for t in tokenizer.explain("ยฑ10%")]
assert tokens == explain_tokens
+
+
+@pytest.mark.issue(10086)
+def test_issue10086(en_tokenizer):
+ """Test special case works when part of infix substring."""
+ text = "No--don't see"
+
+ # without heuristics: do n't
+ en_tokenizer.faster_heuristics = False
+ doc = en_tokenizer(text)
+ assert "n't" in [w.text for w in doc]
+ assert "do" in [w.text for w in doc]
+
+ # with (default) heuristics: don't
+ en_tokenizer.faster_heuristics = True
+ doc = en_tokenizer(text)
+ assert "don't" in [w.text for w in doc]
+
+
+def test_tokenizer_initial_special_case_explain(en_vocab):
+ tokenizer = Tokenizer(
+ en_vocab,
+ token_match=re.compile("^id$").match,
+ rules={
+ "id": [{"ORTH": "i"}, {"ORTH": "d"}],
+ },
+ )
+ tokens = [t.text for t in tokenizer("id")]
+ explain_tokens = [t[1] for t in tokenizer.explain("id")]
+ assert tokens == explain_tokens
diff --git a/spacy/tests/training/test_augmenters.py b/spacy/tests/training/test_augmenters.py
index 43a78e4b0..e3639c5da 100644
--- a/spacy/tests/training/test_augmenters.py
+++ b/spacy/tests/training/test_augmenters.py
@@ -1,9 +1,11 @@
import pytest
-from spacy.training import Corpus
+from spacy.pipeline._parser_internals.nonproj import contains_cycle
+from spacy.training import Corpus, Example
from spacy.training.augment import create_orth_variants_augmenter
from spacy.training.augment import create_lower_casing_augmenter
+from spacy.training.augment import make_whitespace_variant
from spacy.lang.en import English
-from spacy.tokens import DocBin, Doc
+from spacy.tokens import DocBin, Doc, Span
from contextlib import contextmanager
import random
@@ -153,3 +155,84 @@ def test_custom_data_augmentation(nlp, doc):
ents = [(e.start, e.end, e.label) for e in doc.ents]
assert [(e.start, e.end, e.label) for e in corpus[0].reference.ents] == ents
assert [(e.start, e.end, e.label) for e in corpus[1].reference.ents] == ents
+
+
+def test_make_whitespace_variant(nlp):
+ # fmt: off
+ text = "They flew to New York City.\nThen they drove to Washington, D.C."
+ words = ["They", "flew", "to", "New", "York", "City", ".", "\n", "Then", "they", "drove", "to", "Washington", ",", "D.C."]
+ spaces = [True, True, True, True, True, False, False, False, True, True, True, True, False, True, False]
+ tags = ["PRP", "VBD", "IN", "NNP", "NNP", "NNP", ".", "_SP", "RB", "PRP", "VBD", "IN", "NNP", ",", "NNP"]
+ lemmas = ["they", "fly", "to", "New", "York", "City", ".", "\n", "then", "they", "drive", "to", "Washington", ",", "D.C."]
+ heads = [1, 1, 1, 4, 5, 2, 1, 10, 10, 10, 10, 10, 11, 12, 12]
+ deps = ["nsubj", "ROOT", "prep", "compound", "compound", "pobj", "punct", "dep", "advmod", "nsubj", "ROOT", "prep", "pobj", "punct", "appos"]
+ ents = ["O", "O", "O", "B-GPE", "I-GPE", "I-GPE", "O", "O", "O", "O", "O", "O", "B-GPE", "O", "B-GPE"]
+ # fmt: on
+ doc = Doc(
+ nlp.vocab,
+ words=words,
+ spaces=spaces,
+ tags=tags,
+ lemmas=lemmas,
+ heads=heads,
+ deps=deps,
+ ents=ents,
+ )
+ assert doc.text == text
+ example = Example(nlp.make_doc(text), doc)
+ # whitespace is only added internally in entity spans
+ mod_ex = make_whitespace_variant(nlp, example, " ", 3)
+ assert mod_ex.reference.ents[0].text == "New York City"
+ mod_ex = make_whitespace_variant(nlp, example, " ", 4)
+ assert mod_ex.reference.ents[0].text == "New York City"
+ mod_ex = make_whitespace_variant(nlp, example, " ", 5)
+ assert mod_ex.reference.ents[0].text == "New York City"
+ mod_ex = make_whitespace_variant(nlp, example, " ", 6)
+ assert mod_ex.reference.ents[0].text == "New York City"
+ # add a space at every possible position
+ for i in range(len(doc) + 1):
+ mod_ex = make_whitespace_variant(nlp, example, " ", i)
+ assert mod_ex.reference[i].is_space
+ # adds annotation when the doc contains at least partial annotation
+ assert [t.tag_ for t in mod_ex.reference] == tags[:i] + ["_SP"] + tags[i:]
+ assert [t.lemma_ for t in mod_ex.reference] == lemmas[:i] + [" "] + lemmas[i:]
+ assert [t.dep_ for t in mod_ex.reference] == deps[:i] + ["dep"] + deps[i:]
+ # does not add partial annotation if doc does not contain this feature
+ assert not mod_ex.reference.has_annotation("POS")
+ assert not mod_ex.reference.has_annotation("MORPH")
+ # produces well-formed trees
+ assert not contains_cycle([t.head.i for t in mod_ex.reference])
+ assert len(list(doc.sents)) == 2
+ if i == 0:
+ assert mod_ex.reference[i].head.i == 1
+ else:
+ assert mod_ex.reference[i].head.i == i - 1
+ # adding another space also produces well-formed trees
+ for j in (3, 8, 10):
+ mod_ex2 = make_whitespace_variant(nlp, mod_ex, "\t\t\n", j)
+ assert not contains_cycle([t.head.i for t in mod_ex2.reference])
+ assert len(list(doc.sents)) == 2
+ assert mod_ex2.reference[j].head.i == j - 1
+ # entities are well-formed
+ assert len(doc.ents) == len(mod_ex.reference.ents)
+ for ent in mod_ex.reference.ents:
+ assert not ent[0].is_space
+ assert not ent[-1].is_space
+
+ # no modifications if:
+ # partial dependencies
+ example.reference[0].dep_ = ""
+ mod_ex = make_whitespace_variant(nlp, example, " ", 5)
+ assert mod_ex.text == example.reference.text
+ example.reference[0].dep_ = "nsubj" # reset
+
+ # spans
+ example.reference.spans["spans"] = [example.reference[0:5]]
+ mod_ex = make_whitespace_variant(nlp, example, " ", 5)
+ assert mod_ex.text == example.reference.text
+ del example.reference.spans["spans"] # reset
+
+ # links
+ example.reference.ents = [Span(doc, 0, 2, label="ENT", kb_id="Q123")]
+ mod_ex = make_whitespace_variant(nlp, example, " ", 5)
+ assert mod_ex.text == example.reference.text
diff --git a/spacy/tests/training/test_pretraining.py b/spacy/tests/training/test_pretraining.py
index 8ee54b544..9359c8485 100644
--- a/spacy/tests/training/test_pretraining.py
+++ b/spacy/tests/training/test_pretraining.py
@@ -38,7 +38,7 @@ subword_features = true
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
@@ -62,7 +62,7 @@ pipeline = ["tagger"]
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[components.tagger.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v1"
@@ -106,7 +106,7 @@ subword_features = true
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
[components.tagger.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
diff --git a/spacy/tests/training/test_rehearse.py b/spacy/tests/training/test_rehearse.py
new file mode 100644
index 000000000..84c507702
--- /dev/null
+++ b/spacy/tests/training/test_rehearse.py
@@ -0,0 +1,211 @@
+import pytest
+import spacy
+
+from typing import List
+from spacy.training import Example
+
+
+TRAIN_DATA = [
+ (
+ "Who is Kofi Annan?",
+ {
+ "entities": [(7, 18, "PERSON")],
+ "tags": ["PRON", "AUX", "PROPN", "PRON", "PUNCT"],
+ "heads": [1, 1, 3, 1, 1],
+ "deps": ["attr", "ROOT", "compound", "nsubj", "punct"],
+ "morphs": [
+ "",
+ "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
+ "Number=Sing",
+ "Number=Sing",
+ "PunctType=Peri",
+ ],
+ "cats": {"question": 1.0},
+ },
+ ),
+ (
+ "Who is Steve Jobs?",
+ {
+ "entities": [(7, 17, "PERSON")],
+ "tags": ["PRON", "AUX", "PROPN", "PRON", "PUNCT"],
+ "heads": [1, 1, 3, 1, 1],
+ "deps": ["attr", "ROOT", "compound", "nsubj", "punct"],
+ "morphs": [
+ "",
+ "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
+ "Number=Sing",
+ "Number=Sing",
+ "PunctType=Peri",
+ ],
+ "cats": {"question": 1.0},
+ },
+ ),
+ (
+ "Bob is a nice person.",
+ {
+ "entities": [(0, 3, "PERSON")],
+ "tags": ["PROPN", "AUX", "DET", "ADJ", "NOUN", "PUNCT"],
+ "heads": [1, 1, 4, 4, 1, 1],
+ "deps": ["nsubj", "ROOT", "det", "amod", "attr", "punct"],
+ "morphs": [
+ "Number=Sing",
+ "Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin",
+ "Definite=Ind|PronType=Art",
+ "Degree=Pos",
+ "Number=Sing",
+ "PunctType=Peri",
+ ],
+ "cats": {"statement": 1.0},
+ },
+ ),
+ (
+ "Hi Anil, how are you?",
+ {
+ "entities": [(3, 7, "PERSON")],
+ "tags": ["INTJ", "PROPN", "PUNCT", "ADV", "AUX", "PRON", "PUNCT"],
+ "deps": ["intj", "npadvmod", "punct", "advmod", "ROOT", "nsubj", "punct"],
+ "heads": [4, 0, 4, 4, 4, 4, 4],
+ "morphs": [
+ "",
+ "Number=Sing",
+ "PunctType=Comm",
+ "",
+ "Mood=Ind|Tense=Pres|VerbForm=Fin",
+ "Case=Nom|Person=2|PronType=Prs",
+ "PunctType=Peri",
+ ],
+ "cats": {"greeting": 1.0, "question": 1.0},
+ },
+ ),
+ (
+ "I like London and Berlin.",
+ {
+ "entities": [(7, 13, "LOC"), (18, 24, "LOC")],
+ "tags": ["PROPN", "VERB", "PROPN", "CCONJ", "PROPN", "PUNCT"],
+ "deps": ["nsubj", "ROOT", "dobj", "cc", "conj", "punct"],
+ "heads": [1, 1, 1, 2, 2, 1],
+ "morphs": [
+ "Case=Nom|Number=Sing|Person=1|PronType=Prs",
+ "Tense=Pres|VerbForm=Fin",
+ "Number=Sing",
+ "ConjType=Cmp",
+ "Number=Sing",
+ "PunctType=Peri",
+ ],
+ "cats": {"statement": 1.0},
+ },
+ ),
+]
+
+REHEARSE_DATA = [
+ (
+ "Hi Anil",
+ {
+ "entities": [(3, 7, "PERSON")],
+ "tags": ["INTJ", "PROPN"],
+ "deps": ["ROOT", "npadvmod"],
+ "heads": [0, 0],
+ "morphs": ["", "Number=Sing"],
+ "cats": {"greeting": 1.0},
+ },
+ ),
+ (
+ "Hi Ravish, how you doing?",
+ {
+ "entities": [(3, 9, "PERSON")],
+ "tags": ["INTJ", "PROPN", "PUNCT", "ADV", "AUX", "PRON", "PUNCT"],
+ "deps": ["intj", "ROOT", "punct", "advmod", "nsubj", "advcl", "punct"],
+ "heads": [1, 1, 1, 5, 5, 1, 1],
+ "morphs": [
+ "",
+ "VerbForm=Inf",
+ "PunctType=Comm",
+ "",
+ "Case=Nom|Person=2|PronType=Prs",
+ "Aspect=Prog|Tense=Pres|VerbForm=Part",
+ "PunctType=Peri",
+ ],
+ "cats": {"greeting": 1.0, "question": 1.0},
+ },
+ ),
+ # UTENSIL new label
+ (
+ "Natasha bought new forks.",
+ {
+ "entities": [(0, 7, "PERSON"), (19, 24, "UTENSIL")],
+ "tags": ["PROPN", "VERB", "ADJ", "NOUN", "PUNCT"],
+ "deps": ["nsubj", "ROOT", "amod", "dobj", "punct"],
+ "heads": [1, 1, 3, 1, 1],
+ "morphs": [
+ "Number=Sing",
+ "Tense=Past|VerbForm=Fin",
+ "Degree=Pos",
+ "Number=Plur",
+ "PunctType=Peri",
+ ],
+ "cats": {"statement": 1.0},
+ },
+ ),
+]
+
+
+def _add_ner_label(ner, data):
+ for _, annotations in data:
+ for ent in annotations["entities"]:
+ ner.add_label(ent[2])
+
+
+def _add_tagger_label(tagger, data):
+ for _, annotations in data:
+ for tag in annotations["tags"]:
+ tagger.add_label(tag)
+
+
+def _add_parser_label(parser, data):
+ for _, annotations in data:
+ for dep in annotations["deps"]:
+ parser.add_label(dep)
+
+
+def _add_textcat_label(textcat, data):
+ for _, annotations in data:
+ for cat in annotations["cats"]:
+ textcat.add_label(cat)
+
+
+def _optimize(nlp, component: str, data: List, rehearse: bool):
+ """Run either train or rehearse."""
+ pipe = nlp.get_pipe(component)
+ if component == "ner":
+ _add_ner_label(pipe, data)
+ elif component == "tagger":
+ _add_tagger_label(pipe, data)
+ elif component == "parser":
+ _add_tagger_label(pipe, data)
+ elif component == "textcat_multilabel":
+ _add_textcat_label(pipe, data)
+ else:
+ raise NotImplementedError
+
+ if rehearse:
+ optimizer = nlp.resume_training()
+ else:
+ optimizer = nlp.initialize()
+
+ for _ in range(5):
+ for text, annotation in data:
+ doc = nlp.make_doc(text)
+ example = Example.from_dict(doc, annotation)
+ if rehearse:
+ nlp.rehearse([example], sgd=optimizer)
+ else:
+ nlp.update([example], sgd=optimizer)
+ return nlp
+
+
+@pytest.mark.parametrize("component", ["ner", "tagger", "parser", "textcat_multilabel"])
+def test_rehearse(component):
+ nlp = spacy.blank("en")
+ nlp.add_pipe(component)
+ nlp = _optimize(nlp, component, TRAIN_DATA, False)
+ _optimize(nlp, component, REHEARSE_DATA, True)
diff --git a/spacy/tests/training/test_training.py b/spacy/tests/training/test_training.py
index 0d73300d8..8e08a25fb 100644
--- a/spacy/tests/training/test_training.py
+++ b/spacy/tests/training/test_training.py
@@ -8,6 +8,7 @@ from spacy.tokens import Doc, DocBin
from spacy.training import Alignment, Corpus, Example, biluo_tags_to_offsets
from spacy.training import biluo_tags_to_spans, docs_to_json, iob_to_biluo
from spacy.training import offsets_to_biluo_tags
+from spacy.training.alignment_array import AlignmentArray
from spacy.training.align import get_alignments
from spacy.training.converters import json_to_docs
from spacy.util import get_words_and_spaces, load_model_from_path, minibatch
@@ -241,7 +242,7 @@ maxout_pieces = 3
factory = "tagger"
[components.tagger.model]
-@architectures = "spacy.Tagger.v1"
+@architectures = "spacy.Tagger.v2"
nO = null
[components.tagger.model.tok2vec]
@@ -908,9 +909,41 @@ def test_alignment():
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts", "."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [1, 1, 1, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 6]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 6]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 1, 1]
- assert list(align.y2x.dataXd) == [0, 1, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [0, 1, 2, 3, 4, 5, 6, 7]
+
+
+def test_alignment_array():
+ a = AlignmentArray([[0, 1, 2], [3], [], [4, 5, 6, 7], [8, 9]])
+ assert list(a.data) == [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
+ assert list(a.lengths) == [3, 1, 0, 4, 2]
+ assert list(a[3]) == [4, 5, 6, 7]
+ assert list(a[2]) == []
+ assert list(a[-2]) == [4, 5, 6, 7]
+ assert list(a[1:4]) == [3, 4, 5, 6, 7]
+ assert list(a[1:]) == [3, 4, 5, 6, 7, 8, 9]
+ assert list(a[:3]) == [0, 1, 2, 3]
+ assert list(a[:]) == list(a.data)
+ assert list(a[0:0]) == []
+ assert list(a[3:3]) == []
+ assert list(a[-1:-1]) == []
+ with pytest.raises(ValueError, match=r"only supports slicing with a step of 1"):
+ a[:4:-1]
+ with pytest.raises(
+ ValueError, match=r"only supports indexing using an int or a slice"
+ ):
+ a[[0, 1, 3]]
+
+ a = AlignmentArray([[], [1, 2, 3], [4, 5]])
+ assert list(a[0]) == []
+ assert list(a[0:1]) == []
+ assert list(a[2]) == [4, 5]
+ assert list(a[0:2]) == [1, 2, 3]
+
+ a = AlignmentArray([[1, 2, 3], [4, 5], []])
+ assert list(a[-1]) == []
+ assert list(a[-2:]) == [4, 5]
def test_alignment_case_insensitive():
@@ -918,9 +951,9 @@ def test_alignment_case_insensitive():
spacy_tokens = ["i", "listened", "to", "Obama", "'s", "PODCASTS", "."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [1, 1, 1, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 6]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 6]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 1, 1]
- assert list(align.y2x.dataXd) == [0, 1, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [0, 1, 2, 3, 4, 5, 6, 7]
def test_alignment_complex():
@@ -928,9 +961,9 @@ def test_alignment_complex():
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
+ assert list(align.y2x.data) == [0, 0, 0, 1, 2, 3, 4, 5]
def test_alignment_complex_example(en_vocab):
@@ -947,9 +980,9 @@ def test_alignment_complex_example(en_vocab):
example = Example(predicted, reference)
align = example.alignment
assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
+ assert list(align.y2x.data) == [0, 0, 0, 1, 2, 3, 4, 5]
def test_alignment_different_texts():
@@ -965,70 +998,70 @@ def test_alignment_spaces(en_vocab):
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [0, 3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [1, 1, 1, 2, 3, 4, 5, 6]
+ assert list(align.y2x.data) == [1, 1, 1, 2, 3, 4, 5, 6]
# multiple leading whitespace tokens
other_tokens = [" ", " ", "i listened to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [0, 0, 3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [2, 2, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [2, 2, 2, 3, 4, 5, 6, 7]
# both with leading whitespace, not identical
other_tokens = [" ", " ", "i listened to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = [" ", "i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [1, 0, 3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 5, 5, 6, 6]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 5, 5, 6, 6]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [0, 2, 2, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [0, 2, 2, 2, 3, 4, 5, 6, 7]
# same leading whitespace, different tokenization
other_tokens = [" ", " ", "i listened to", "obama", "'", "s", "podcasts", "."]
spacy_tokens = [" ", "i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [1, 1, 3, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 0, 1, 2, 3, 4, 5, 5, 6, 6]
+ assert list(align.x2y.data) == [0, 0, 1, 2, 3, 4, 5, 5, 6, 6]
assert list(align.y2x.lengths) == [2, 1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [0, 1, 2, 2, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [0, 1, 2, 2, 2, 3, 4, 5, 6, 7]
# only one with trailing whitespace
other_tokens = ["i listened to", "obama", "'", "s", "podcasts", ".", " "]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts."]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1, 0]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2]
- assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5]
+ assert list(align.y2x.data) == [0, 0, 0, 1, 2, 3, 4, 5]
# different trailing whitespace
other_tokens = ["i listened to", "obama", "'", "s", "podcasts", ".", " ", " "]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts.", " "]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1, 1, 0]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5, 6]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5, 6]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2, 1]
- assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5, 6]
+ assert list(align.y2x.data) == [0, 0, 0, 1, 2, 3, 4, 5, 6]
# same trailing whitespace, different tokenization
other_tokens = ["i listened to", "obama", "'", "s", "podcasts", ".", " ", " "]
spacy_tokens = ["i", "listened", "to", "obama", "'s", "podcasts.", " "]
align = Alignment.from_strings(other_tokens, spacy_tokens)
assert list(align.x2y.lengths) == [3, 1, 1, 1, 1, 1, 1, 1]
- assert list(align.x2y.dataXd) == [0, 1, 2, 3, 4, 4, 5, 5, 6, 6]
+ assert list(align.x2y.data) == [0, 1, 2, 3, 4, 4, 5, 5, 6, 6]
assert list(align.y2x.lengths) == [1, 1, 1, 1, 2, 2, 2]
- assert list(align.y2x.dataXd) == [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
+ assert list(align.y2x.data) == [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
# differing whitespace is allowed
other_tokens = ["a", " \n ", "b", "c"]
spacy_tokens = ["a", "b", " ", "c"]
align = Alignment.from_strings(other_tokens, spacy_tokens)
- assert list(align.x2y.dataXd) == [0, 1, 3]
- assert list(align.y2x.dataXd) == [0, 2, 3]
+ assert list(align.x2y.data) == [0, 1, 3]
+ assert list(align.y2x.data) == [0, 2, 3]
# other differences in whitespace are allowed
other_tokens = [" ", "a"]
diff --git a/spacy/tests/universe/test_universe_json.py b/spacy/tests/universe/test_universe_json.py
deleted file mode 100644
index 295889186..000000000
--- a/spacy/tests/universe/test_universe_json.py
+++ /dev/null
@@ -1,17 +0,0 @@
-import json
-import re
-from pathlib import Path
-
-
-def test_universe_json():
-
- root_dir = Path(__file__).parent
- universe_file = root_dir / "universe.json"
-
- with universe_file.open() as f:
- universe_data = json.load(f)
- for entry in universe_data["resources"]:
- if "github" in entry:
- assert not re.match(
- r"^(http:)|^(https:)", entry["github"]
- ), "Github field should be user/repo, not a url"
diff --git a/spacy/tests/vocab_vectors/test_vectors.py b/spacy/tests/vocab_vectors/test_vectors.py
index 0650a7487..e3ad206f4 100644
--- a/spacy/tests/vocab_vectors/test_vectors.py
+++ b/spacy/tests/vocab_vectors/test_vectors.py
@@ -455,6 +455,39 @@ def test_vectors_get_batch():
assert_equal(OPS.to_numpy(vecs), OPS.to_numpy(v.get_batch(words)))
+def test_vectors_deduplicate():
+ data = OPS.asarray([[1, 1], [2, 2], [3, 4], [1, 1], [3, 4]], dtype="f")
+ v = Vectors(data=data, keys=["a1", "b1", "c1", "a2", "c2"])
+ vocab = Vocab()
+ vocab.vectors = v
+ # duplicate vectors do not use the same keys
+ assert (
+ vocab.vectors.key2row[v.strings["a1"]] != vocab.vectors.key2row[v.strings["a2"]]
+ )
+ assert (
+ vocab.vectors.key2row[v.strings["c1"]] != vocab.vectors.key2row[v.strings["c2"]]
+ )
+ vocab.deduplicate_vectors()
+ # there are three unique vectors
+ assert vocab.vectors.shape[0] == 3
+ # the uniqued data is the same as the deduplicated data
+ assert_equal(
+ numpy.unique(OPS.to_numpy(vocab.vectors.data), axis=0),
+ OPS.to_numpy(vocab.vectors.data),
+ )
+ # duplicate vectors use the same keys now
+ assert (
+ vocab.vectors.key2row[v.strings["a1"]] == vocab.vectors.key2row[v.strings["a2"]]
+ )
+ assert (
+ vocab.vectors.key2row[v.strings["c1"]] == vocab.vectors.key2row[v.strings["c2"]]
+ )
+ # deduplicating again makes no changes
+ vocab_b = vocab.to_bytes()
+ vocab.deduplicate_vectors()
+ assert vocab_b == vocab.to_bytes()
+
+
@pytest.fixture()
def floret_vectors_hashvec_str():
"""The full hashvec table from floret with the settings:
@@ -535,6 +568,10 @@ def test_floret_vectors(floret_vectors_vec_str, floret_vectors_hashvec_str):
# every word has a vector
assert nlp.vocab[word * 5].has_vector
+ # n_keys is -1 for floret
+ assert nlp_plain.vocab.vectors.n_keys > 0
+ assert nlp.vocab.vectors.n_keys == -1
+
# check that single and batched vector lookups are identical
words = [s for s in nlp_plain.vocab.vectors]
single_vecs = OPS.to_numpy(OPS.asarray([nlp.vocab[word].vector for word in words]))
diff --git a/spacy/tokenizer.pxd b/spacy/tokenizer.pxd
index fa38a1015..e6a072053 100644
--- a/spacy/tokenizer.pxd
+++ b/spacy/tokenizer.pxd
@@ -23,9 +23,10 @@ cdef class Tokenizer:
cdef object _infix_finditer
cdef object _rules
cdef PhraseMatcher _special_matcher
- # TODO next two are unused and should be removed in v4
+ # TODO convert to bool in v4
+ cdef int _faster_heuristics
+ # TODO next one is unused and should be removed in v4
# https://github.com/explosion/spaCy/pull/9150
- cdef int _unused_int1
cdef int _unused_int2
cdef Doc _tokenize_affixes(self, str string, bint with_special_cases)
diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index 91f228032..0e75b5f7a 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -34,7 +34,7 @@ cdef class Tokenizer:
"""
def __init__(self, Vocab vocab, rules=None, prefix_search=None,
suffix_search=None, infix_finditer=None, token_match=None,
- url_match=None):
+ url_match=None, faster_heuristics=True):
"""Create a `Tokenizer`, to create `Doc` objects given unicode text.
vocab (Vocab): A storage container for lexical types.
@@ -43,7 +43,7 @@ cdef class Tokenizer:
`re.compile(string).search` to match prefixes.
suffix_search (callable): A function matching the signature of
`re.compile(string).search` to match suffixes.
- `infix_finditer` (callable): A function matching the signature of
+ infix_finditer (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes.
token_match (callable): A function matching the signature of
`re.compile(string).match`, for matching strings to be
@@ -51,6 +51,9 @@ cdef class Tokenizer:
url_match (callable): A function matching the signature of
`re.compile(string).match`, for matching strings to be
recognized as urls.
+ faster_heuristics (bool): Whether to restrict the final
+ Matcher-based pass for rules to those containing affixes or space.
+ Defaults to True.
EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab)
@@ -66,6 +69,7 @@ cdef class Tokenizer:
self.suffix_search = suffix_search
self.infix_finditer = infix_finditer
self.vocab = vocab
+ self.faster_heuristics = faster_heuristics
self._rules = {}
self._special_matcher = PhraseMatcher(self.vocab)
self._load_special_cases(rules)
@@ -122,6 +126,14 @@ cdef class Tokenizer:
self._specials = PreshMap()
self._load_special_cases(rules)
+ property faster_heuristics:
+ def __get__(self):
+ return bool(self._faster_heuristics)
+
+ def __set__(self, faster_heuristics):
+ self._faster_heuristics = bool(faster_heuristics)
+ self._reload_special_cases()
+
def __reduce__(self):
args = (self.vocab,
self.rules,
@@ -287,7 +299,7 @@ cdef class Tokenizer:
spans = [doc[match.start:match.end] for match in filtered]
cdef bint modify_in_place = True
cdef int curr_length = doc.length
- cdef int max_length
+ cdef int max_length = 0
cdef int span_length_diff = 0
span_data = {}
for span in spans:
@@ -602,7 +614,7 @@ cdef class Tokenizer:
self.mem.free(stale_special)
self._rules[string] = substrings
self._flush_cache()
- if self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
+ if not self.faster_heuristics or self.find_prefix(string) or self.find_infix(string) or self.find_suffix(string) or " " in string:
self._special_matcher.add(string, None, self._tokenize_affixes(string, False))
def _reload_special_cases(self):
@@ -643,6 +655,10 @@ cdef class Tokenizer:
for substring in text.split():
suffixes = []
while substring:
+ if substring in special_cases:
+ tokens.extend(("SPECIAL-" + str(i + 1), self.vocab.strings[e[ORTH]]) for i, e in enumerate(special_cases[substring]))
+ substring = ''
+ continue
while prefix_search(substring) or suffix_search(substring):
if token_match(substring):
tokens.append(("TOKEN_MATCH", substring))
@@ -773,7 +789,8 @@ cdef class Tokenizer:
"infix_finditer": lambda: _get_regex_pattern(self.infix_finditer),
"token_match": lambda: _get_regex_pattern(self.token_match),
"url_match": lambda: _get_regex_pattern(self.url_match),
- "exceptions": lambda: dict(sorted(self._rules.items()))
+ "exceptions": lambda: dict(sorted(self._rules.items())),
+ "faster_heuristics": lambda: self.faster_heuristics,
}
return util.to_bytes(serializers, exclude)
@@ -794,7 +811,8 @@ cdef class Tokenizer:
"infix_finditer": lambda b: data.setdefault("infix_finditer", b),
"token_match": lambda b: data.setdefault("token_match", b),
"url_match": lambda b: data.setdefault("url_match", b),
- "exceptions": lambda b: data.setdefault("rules", b)
+ "exceptions": lambda b: data.setdefault("rules", b),
+ "faster_heuristics": lambda b: data.setdefault("faster_heuristics", b),
}
# reset all properties and flush all caches (through rules),
# reset rules first so that _reload_special_cases is trivial/fast as
@@ -818,6 +836,8 @@ cdef class Tokenizer:
self.url_match = re.compile(data["url_match"]).match
if "rules" in data and isinstance(data["rules"], dict):
self.rules = data["rules"]
+ if "faster_heuristics" in data:
+ self.faster_heuristics = data["faster_heuristics"]
return self
diff --git a/spacy/tokens/_dict_proxies.py b/spacy/tokens/_dict_proxies.py
index 470d3430f..8643243fa 100644
--- a/spacy/tokens/_dict_proxies.py
+++ b/spacy/tokens/_dict_proxies.py
@@ -6,6 +6,7 @@ import srsly
from .span_group import SpanGroup
from ..errors import Errors
+
if TYPE_CHECKING:
# This lets us add type hints for mypy etc. without causing circular imports
from .doc import Doc # noqa: F401
@@ -19,6 +20,8 @@ if TYPE_CHECKING:
class SpanGroups(UserDict):
"""A dict-like proxy held by the Doc, to control access to span groups."""
+ _EMPTY_BYTES = srsly.msgpack_dumps([])
+
def __init__(
self, doc: "Doc", items: Iterable[Tuple[str, SpanGroup]] = tuple()
) -> None:
@@ -43,11 +46,13 @@ class SpanGroups(UserDict):
def to_bytes(self) -> bytes:
# We don't need to serialize this as a dict, because the groups
# know their names.
+ if len(self) == 0:
+ return self._EMPTY_BYTES
msg = [value.to_bytes() for value in self.values()]
return srsly.msgpack_dumps(msg)
def from_bytes(self, bytes_data: bytes) -> "SpanGroups":
- msg = srsly.msgpack_loads(bytes_data)
+ msg = [] if bytes_data == self._EMPTY_BYTES else srsly.msgpack_loads(bytes_data)
self.clear()
doc = self._ensure_doc()
for value_bytes in msg:
diff --git a/spacy/tokens/_serialize.py b/spacy/tokens/_serialize.py
index bd2bdb811..c4e8f26f4 100644
--- a/spacy/tokens/_serialize.py
+++ b/spacy/tokens/_serialize.py
@@ -12,6 +12,7 @@ from ..compat import copy_reg
from ..attrs import SPACY, ORTH, intify_attr, IDS
from ..errors import Errors
from ..util import ensure_path, SimpleFrozenList
+from ._dict_proxies import SpanGroups
# fmt: off
ALL_ATTRS = ("ORTH", "NORM", "TAG", "HEAD", "DEP", "ENT_IOB", "ENT_TYPE", "ENT_KB_ID", "ENT_ID", "LEMMA", "MORPH", "POS", "SENT_START")
@@ -146,7 +147,8 @@ class DocBin:
doc = Doc(vocab, words=tokens[:, orth_col], spaces=spaces) # type: ignore
doc = doc.from_array(self.attrs, tokens) # type: ignore
doc.cats = self.cats[i]
- if self.span_groups[i]:
+ # backwards-compatibility: may be b'' or serialized empty list
+ if self.span_groups[i] and self.span_groups[i] != SpanGroups._EMPTY_BYTES:
doc.spans.from_bytes(self.span_groups[i])
else:
doc.spans.clear()
diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx
index 5a0db115d..c36e3a02f 100644
--- a/spacy/tokens/doc.pyx
+++ b/spacy/tokens/doc.pyx
@@ -11,7 +11,7 @@ from enum import Enum
import itertools
import numpy
import srsly
-from thinc.api import get_array_module
+from thinc.api import get_array_module, get_current_ops
from thinc.util import copy_array
import warnings
@@ -420,6 +420,8 @@ cdef class Doc:
cdef int range_start = 0
if attr == "IS_SENT_START" or attr == self.vocab.strings["IS_SENT_START"]:
attr = SENT_START
+ elif attr == "IS_SENT_END" or attr == self.vocab.strings["IS_SENT_END"]:
+ attr = SENT_START
attr = intify_attr(attr)
# adjust attributes
if attr == HEAD:
@@ -1106,14 +1108,19 @@ cdef class Doc:
return self
@staticmethod
- def from_docs(docs, ensure_whitespace=True, attrs=None):
+ def from_docs(docs, ensure_whitespace=True, attrs=None, *, exclude=tuple()):
"""Concatenate multiple Doc objects to form a new one. Raises an error
if the `Doc` objects do not all share the same `Vocab`.
docs (list): A list of Doc objects.
- ensure_whitespace (bool): Insert a space between two adjacent docs whenever the first doc does not end in whitespace.
- attrs (list): Optional list of attribute ID ints or attribute name strings.
- RETURNS (Doc): A doc that contains the concatenated docs, or None if no docs were given.
+ ensure_whitespace (bool): Insert a space between two adjacent docs
+ whenever the first doc does not end in whitespace.
+ attrs (list): Optional list of attribute ID ints or attribute name
+ strings.
+ exclude (Iterable[str]): Doc attributes to exclude. Supported
+ attributes: `spans`, `tensor`, `user_data`.
+ RETURNS (Doc): A doc that contains the concatenated docs, or None if no
+ docs were given.
DOCS: https://spacy.io/api/doc#from_docs
"""
@@ -1143,31 +1150,33 @@ cdef class Doc:
concat_words.extend(t.text for t in doc)
concat_spaces.extend(bool(t.whitespace_) for t in doc)
- for key, value in doc.user_data.items():
- if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
- data_type, name, start, end = key
- if start is not None or end is not None:
- start += char_offset
- if end is not None:
- end += char_offset
- concat_user_data[(data_type, name, start, end)] = copy.copy(value)
+ if "user_data" not in exclude:
+ for key, value in doc.user_data.items():
+ if isinstance(key, tuple) and len(key) == 4 and key[0] == "._.":
+ data_type, name, start, end = key
+ if start is not None or end is not None:
+ start += char_offset
+ if end is not None:
+ end += char_offset
+ concat_user_data[(data_type, name, start, end)] = copy.copy(value)
+ else:
+ warnings.warn(Warnings.W101.format(name=name))
else:
- warnings.warn(Warnings.W101.format(name=name))
- else:
- warnings.warn(Warnings.W102.format(key=key, value=value))
- for key in doc.spans:
- # if a spans key is in any doc, include it in the merged doc
- # even if it is empty
- if key not in concat_spans:
- concat_spans[key] = []
- for span in doc.spans[key]:
- concat_spans[key].append((
- span.start_char + char_offset,
- span.end_char + char_offset,
- span.label,
- span.kb_id,
- span.text, # included as a check
- ))
+ warnings.warn(Warnings.W102.format(key=key, value=value))
+ if "spans" not in exclude:
+ for key in doc.spans:
+ # if a spans key is in any doc, include it in the merged doc
+ # even if it is empty
+ if key not in concat_spans:
+ concat_spans[key] = []
+ for span in doc.spans[key]:
+ concat_spans[key].append((
+ span.start_char + char_offset,
+ span.end_char + char_offset,
+ span.label,
+ span.kb_id,
+ span.text, # included as a check
+ ))
char_offset += len(doc.text)
if len(doc) > 0 and ensure_whitespace and not doc[-1].is_space and not bool(doc[-1].whitespace_):
char_offset += 1
@@ -1208,6 +1217,10 @@ cdef class Doc:
else:
raise ValueError(Errors.E873.format(key=key, text=text))
+ if "tensor" not in exclude and any(len(doc) for doc in docs):
+ ops = get_current_ops()
+ concat_doc.tensor = ops.xp.vstack([ops.asarray(doc.tensor) for doc in docs if len(doc)])
+
return concat_doc
def get_lca_matrix(self):
@@ -1455,7 +1468,7 @@ cdef class Doc:
underscore (list): Optional list of string names of custom doc._.
attributes. Attribute values need to be JSON-serializable. Values will
be added to an "_" key in the data, e.g. "_": {"foo": "bar"}.
- RETURNS (dict): The data in spaCy's JSON format.
+ RETURNS (dict): The data in JSON format.
"""
data = {"text": self.text}
if self.has_annotation("ENT_IOB"):
@@ -1484,6 +1497,15 @@ cdef class Doc:
token_data["dep"] = token.dep_
token_data["head"] = token.head.i
data["tokens"].append(token_data)
+
+ if self.spans:
+ data["spans"] = {}
+ for span_group in self.spans:
+ data["spans"][span_group] = []
+ for span in self.spans[span_group]:
+ span_data = {"start": span.start_char, "end": span.end_char, "label": span.label_, "kb_id": span.kb_id_}
+ data["spans"][span_group].append(span_data)
+
if underscore:
data["_"] = {}
for attr in underscore:
diff --git a/spacy/tokens/graph.pyx b/spacy/tokens/graph.pyx
index 9351435f8..adc4d23c8 100644
--- a/spacy/tokens/graph.pyx
+++ b/spacy/tokens/graph.pyx
@@ -9,6 +9,8 @@ cimport cython
import weakref
from preshed.maps cimport map_get_unless_missing
from murmurhash.mrmr cimport hash64
+
+from .. import Errors
from ..typedefs cimport hash_t
from ..strings import get_string_id
from ..structs cimport EdgeC, GraphC
@@ -68,7 +70,7 @@ cdef class Node:
"""
cdef int length = graph.c.nodes.size()
if i >= length or -i >= length:
- raise IndexError(f"Node index {i} out of bounds ({length})")
+ raise IndexError(Errors.E1034.format(i=i, length=length))
if i < 0:
i += length
self.graph = graph
@@ -88,7 +90,7 @@ cdef class Node:
"""Get a token index from the node's set of tokens."""
length = self.graph.c.nodes[self.i].size()
if i >= length or -i >= length:
- raise IndexError(f"Token index {i} out of bounds ({length})")
+ raise IndexError(Errors.E1035.format(i=i, length=length))
if i < 0:
i += length
return self.graph.c.nodes[self.i][i]
@@ -306,7 +308,7 @@ cdef class NoneNode(Node):
self.i = -1
def __getitem__(self, int i):
- raise IndexError("Cannot index into NoneNode.")
+ raise IndexError(Errors.E1036)
def __len__(self):
return 0
@@ -484,7 +486,6 @@ cdef class Graph:
for idx in indices:
node.push_back(idx)
i = add_node(&self.c, node)
- print("Add node", indices, i)
return Node(self, i)
def get_node(self, indices) -> Node:
@@ -501,7 +502,6 @@ cdef class Graph:
if node_index < 0:
return NoneNode(self)
else:
- print("Get node", indices, node_index)
return Node(self, node_index)
def has_node(self, tuple indices) -> bool:
@@ -661,8 +661,6 @@ cdef int walk_head_nodes(vector[int]& output, const GraphC* graph, int node) nog
seen.insert(node)
i = 0
while i < output.size():
- with gil:
- print("Walk up from", output[i])
if seen.find(output[i]) == seen.end():
seen.insert(output[i])
get_head_nodes(output, graph, output[i])
diff --git a/spacy/tokens/span.pyx b/spacy/tokens/span.pyx
index 4b0c724e5..305d7caf4 100644
--- a/spacy/tokens/span.pyx
+++ b/spacy/tokens/span.pyx
@@ -730,7 +730,7 @@ cdef class Span:
def __set__(self, int start):
if start < 0:
- raise IndexError("TODO")
+ raise IndexError(Errors.E1032.format(var="start", forbidden="< 0", value=start))
self.c.start = start
property end:
@@ -739,7 +739,7 @@ cdef class Span:
def __set__(self, int end):
if end < 0:
- raise IndexError("TODO")
+ raise IndexError(Errors.E1032.format(var="end", forbidden="< 0", value=end))
self.c.end = end
property start_char:
@@ -748,7 +748,7 @@ cdef class Span:
def __set__(self, int start_char):
if start_char < 0:
- raise IndexError("TODO")
+ raise IndexError(Errors.E1032.format(var="start_char", forbidden="< 0", value=start_char))
self.c.start_char = start_char
property end_char:
@@ -757,7 +757,7 @@ cdef class Span:
def __set__(self, int end_char):
if end_char < 0:
- raise IndexError("TODO")
+ raise IndexError(Errors.E1032.format(var="end_char", forbidden="< 0", value=end_char))
self.c.end_char = end_char
property label:
diff --git a/spacy/tokens/span_group.pyx b/spacy/tokens/span_group.pyx
index 6cfa75237..bb0fab24f 100644
--- a/spacy/tokens/span_group.pyx
+++ b/spacy/tokens/span_group.pyx
@@ -1,10 +1,11 @@
+from typing import Iterable, Tuple, Union, Optional, TYPE_CHECKING
import weakref
import struct
+from copy import deepcopy
import srsly
from spacy.errors import Errors
from .span cimport Span
-from libc.stdint cimport uint64_t, uint32_t, int32_t
cdef class SpanGroup:
@@ -20,13 +21,13 @@ cdef class SpanGroup:
>>> doc.spans["errors"] = SpanGroup(
doc,
name="errors",
- spans=[doc[0:1], doc[2:4]],
+ spans=[doc[0:1], doc[1:3]],
attrs={"annotator": "matt"}
)
Construction 2
>>> doc = nlp("Their goi ng home")
- >>> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+ >>> doc.spans["errors"] = [doc[0:1], doc[1:3]]
>>> assert isinstance(doc.spans["errors"], SpanGroup)
DOCS: https://spacy.io/api/spangroup
@@ -48,6 +49,8 @@ cdef class SpanGroup:
self.name = name
self.attrs = dict(attrs) if attrs is not None else {}
cdef Span span
+ if len(spans) :
+ self.c.reserve(len(spans))
for span in spans:
self.push_back(span.c)
@@ -89,6 +92,72 @@ cdef class SpanGroup:
"""
return self.c.size()
+ def __getitem__(self, int i) -> Span:
+ """Get a span from the group. Note that a copy of the span is returned,
+ so if any changes are made to this span, they are not reflected in the
+ corresponding member of the span group.
+
+ i (int): The item index.
+ RETURNS (Span): The span at the given index.
+
+ DOCS: https://spacy.io/api/spangroup#getitem
+ """
+ i = self._normalize_index(i)
+ return Span.cinit(self.doc, self.c[i])
+
+ def __delitem__(self, int i):
+ """Delete a span from the span group at index i.
+
+ i (int): The item index.
+
+ DOCS: https://spacy.io/api/spangroup#delitem
+ """
+ i = self._normalize_index(i)
+ self.c.erase(self.c.begin() + i - 1)
+
+ def __setitem__(self, int i, Span span):
+ """Set a span in the span group.
+
+ i (int): The item index.
+ span (Span): The span.
+
+ DOCS: https://spacy.io/api/spangroup#setitem
+ """
+ if span.doc is not self.doc:
+ raise ValueError(Errors.E855.format(obj="span"))
+
+ i = self._normalize_index(i)
+ self.c[i] = span.c
+
+ def __iadd__(self, other: Union[SpanGroup, Iterable["Span"]]) -> SpanGroup:
+ """Operator +=. Append a span group or spans to this group and return
+ the current span group.
+
+ other (Union[SpanGroup, Iterable["Span"]]): The SpanGroup or spans to
+ add.
+
+ RETURNS (SpanGroup): The current span group.
+
+ DOCS: https://spacy.io/api/spangroup#iadd
+ """
+ return self._concat(other, inplace=True)
+
+ def __add__(self, other: SpanGroup) -> SpanGroup:
+ """Operator +. Concatenate a span group with this group and return a
+ new span group.
+
+ other (SpanGroup): The SpanGroup to add.
+
+ RETURNS (SpanGroup): The concatenated SpanGroup.
+
+ DOCS: https://spacy.io/api/spangroup#add
+ """
+ # For Cython 0.x and __add__, you cannot rely on `self` as being `self`
+ # or being the right type, so both types need to be checked explicitly.
+ if isinstance(self, SpanGroup) and isinstance(other, SpanGroup):
+ return self._concat(other)
+ return NotImplemented
+
def append(self, Span span):
"""Add a span to the group. The span must refer to the same Doc
object as the span group.
@@ -98,35 +167,18 @@ cdef class SpanGroup:
DOCS: https://spacy.io/api/spangroup#append
"""
if span.doc is not self.doc:
- raise ValueError("Cannot add span to group: refers to different Doc.")
+ raise ValueError(Errors.E855.format(obj="span"))
self.push_back(span.c)
- def extend(self, spans):
- """Add multiple spans to the group. All spans must refer to the same
- Doc object as the span group.
+ def extend(self, spans_or_span_group: Union[SpanGroup, Iterable["Span"]]):
+ """Add multiple spans or contents of another SpanGroup to the group.
+ All spans must refer to the same Doc object as the span group.
- spans (Iterable[Span]): The spans to add.
+ spans (Union[SpanGroup, Iterable["Span"]]): The spans to add.
DOCS: https://spacy.io/api/spangroup#extend
"""
- cdef Span span
- for span in spans:
- self.append(span)
-
- def __getitem__(self, int i):
- """Get a span from the group.
-
- i (int): The item index.
- RETURNS (Span): The span at the given index.
-
- DOCS: https://spacy.io/api/spangroup#getitem
- """
- cdef int size = self.c.size()
- if i < -size or i >= size:
- raise IndexError(f"list index {i} out of range")
- if i < 0:
- i += size
- return Span.cinit(self.doc, self.c[i])
+ self._concat(spans_or_span_group, inplace=True)
def to_bytes(self):
"""Serialize the SpanGroup's contents to a byte string.
@@ -136,6 +188,7 @@ cdef class SpanGroup:
DOCS: https://spacy.io/api/spangroup#to_bytes
"""
output = {"name": self.name, "attrs": self.attrs, "spans": []}
+ cdef int i
for i in range(self.c.size()):
span = self.c[i]
# The struct.pack here is probably overkill, but it might help if
@@ -187,3 +240,74 @@ cdef class SpanGroup:
cdef void push_back(self, SpanC span) nogil:
self.c.push_back(span)
+
+ def copy(self) -> SpanGroup:
+ """Clones the span group.
+
+ RETURNS (SpanGroup): A copy of the span group.
+
+ DOCS: https://spacy.io/api/spangroup#copy
+ """
+ return SpanGroup(
+ self.doc,
+ name=self.name,
+ attrs=deepcopy(self.attrs),
+ spans=list(self),
+ )
+
+ def _concat(
+ self,
+ other: Union[SpanGroup, Iterable["Span"]],
+ *,
+ inplace: bool = False,
+ ) -> SpanGroup:
+ """Concatenates the current span group with the provided span group or
+ spans, either in place or creating a copy. Preserves the name of self,
+ updates attrs only with values that are not in self.
+
+ other (Union[SpanGroup, Iterable[Span]]): The spans to append.
+ inplace (bool): Indicates whether the operation should be performed in
+ place on the current span group.
+
+ RETURNS (SpanGroup): Either a new SpanGroup or the current SpanGroup
+ depending on the value of inplace.
+ """
+ cdef SpanGroup span_group = self if inplace else self.copy()
+ cdef SpanGroup other_group
+ cdef Span span
+
+ if isinstance(other, SpanGroup):
+ other_group = other
+ if other_group.doc is not self.doc:
+ raise ValueError(Errors.E855.format(obj="span group"))
+
+ other_attrs = deepcopy(other_group.attrs)
+ span_group.attrs.update({
+ key: value for key, value in other_attrs.items() \
+ if key not in span_group.attrs
+ })
+ if len(other_group):
+ span_group.c.reserve(span_group.c.size() + other_group.c.size())
+ span_group.c.insert(span_group.c.end(), other_group.c.begin(), other_group.c.end())
+ else:
+ if len(other):
+ span_group.c.reserve(self.c.size() + len(other))
+ for span in other:
+ if span.doc is not self.doc:
+ raise ValueError(Errors.E855.format(obj="span"))
+ span_group.c.push_back(span.c)
+
+ return span_group
+
+ def _normalize_index(self, int i) -> int:
+ """Checks list index boundaries and adjusts the index if negative.
+
+ i (int): The index.
+ RETURNS (int): The adjusted index.
+ """
+ cdef int length = self.c.size()
+ if i < -length or i >= length:
+ raise IndexError(Errors.E856.format(i=i, length=length))
+ if i < 0:
+ i += length
+ return i
diff --git a/spacy/tokens/token.pyx b/spacy/tokens/token.pyx
index b515ab67b..d14930348 100644
--- a/spacy/tokens/token.pyx
+++ b/spacy/tokens/token.pyx
@@ -487,8 +487,6 @@ cdef class Token:
RETURNS (bool / None): Whether the token starts a sentence.
None if unknown.
-
- DOCS: https://spacy.io/api/token#is_sent_start
"""
def __get__(self):
if self.c.sent_start == 0:
diff --git a/spacy/training/alignment.py b/spacy/training/alignment.py
index 3e3b60ca6..6d24714bf 100644
--- a/spacy/training/alignment.py
+++ b/spacy/training/alignment.py
@@ -1,31 +1,22 @@
from typing import List
-import numpy
-from thinc.types import Ragged
from dataclasses import dataclass
from .align import get_alignments
+from .alignment_array import AlignmentArray
@dataclass
class Alignment:
- x2y: Ragged
- y2x: Ragged
+ x2y: AlignmentArray
+ y2x: AlignmentArray
@classmethod
def from_indices(cls, x2y: List[List[int]], y2x: List[List[int]]) -> "Alignment":
- x2y = _make_ragged(x2y)
- y2x = _make_ragged(y2x)
+ x2y = AlignmentArray(x2y)
+ y2x = AlignmentArray(y2x)
return Alignment(x2y=x2y, y2x=y2x)
@classmethod
def from_strings(cls, A: List[str], B: List[str]) -> "Alignment":
x2y, y2x = get_alignments(A, B)
return Alignment.from_indices(x2y=x2y, y2x=y2x)
-
-
-def _make_ragged(indices):
- lengths = numpy.array([len(x) for x in indices], dtype="i")
- flat = []
- for x in indices:
- flat.extend(x)
- return Ragged(numpy.array(flat, dtype="i"), lengths)
diff --git a/spacy/training/alignment_array.pxd b/spacy/training/alignment_array.pxd
new file mode 100644
index 000000000..056f5bef3
--- /dev/null
+++ b/spacy/training/alignment_array.pxd
@@ -0,0 +1,7 @@
+from libcpp.vector cimport vector
+cimport numpy as np
+
+cdef class AlignmentArray:
+ cdef np.ndarray _data
+ cdef np.ndarray _lengths
+ cdef np.ndarray _starts_ends
diff --git a/spacy/training/alignment_array.pyx b/spacy/training/alignment_array.pyx
new file mode 100644
index 000000000..b58f08786
--- /dev/null
+++ b/spacy/training/alignment_array.pyx
@@ -0,0 +1,68 @@
+from typing import List
+from ..errors import Errors
+import numpy
+
+
+cdef class AlignmentArray:
+ """AlignmentArray is similar to Thinc's Ragged with two simplfications:
+ indexing returns numpy arrays and this type can only be used for CPU arrays.
+ However, these changes make AlginmentArray more efficient for indexing in a
+ tight loop."""
+
+ __slots__ = []
+
+ def __init__(self, alignment: List[List[int]]):
+ self._lengths = None
+ self._starts_ends = numpy.zeros(len(alignment) + 1, dtype="i")
+
+ cdef int data_len = 0
+ cdef int outer_len
+ cdef int idx
+ for idx, outer in enumerate(alignment):
+ outer_len = len(outer)
+ self._starts_ends[idx + 1] = self._starts_ends[idx] + outer_len
+ data_len += outer_len
+
+ self._data = numpy.empty(data_len, dtype="i")
+ idx = 0
+ for outer in alignment:
+ for inner in outer:
+ self._data[idx] = inner
+ idx += 1
+
+ def __getitem__(self, idx):
+ starts = self._starts_ends[:-1]
+ ends = self._starts_ends[1:]
+ if isinstance(idx, int):
+ start = starts[idx]
+ end = ends[idx]
+ elif isinstance(idx, slice):
+ if not (idx.step is None or idx.step == 1):
+ raise ValueError(Errors.E1027)
+ start = starts[idx]
+ if len(start) == 0:
+ return self._data[0:0]
+ start = start[0]
+ end = ends[idx][-1]
+ else:
+ raise ValueError(Errors.E1028)
+
+ return self._data[start:end]
+
+ @property
+ def data(self):
+ return self._data
+
+ @property
+ def lengths(self):
+ if self._lengths is None:
+ self._lengths = self.ends - self.starts
+ return self._lengths
+
+ @property
+ def ends(self):
+ return self._starts_ends[1:]
+
+ @property
+ def starts(self):
+ return self._starts_ends[:-1]
diff --git a/spacy/training/augment.py b/spacy/training/augment.py
index 63b54034c..59a39c7ee 100644
--- a/spacy/training/augment.py
+++ b/spacy/training/augment.py
@@ -1,4 +1,5 @@
from typing import Callable, Iterator, Dict, List, Tuple, TYPE_CHECKING
+from typing import Optional
import random
import itertools
from functools import partial
@@ -11,32 +12,87 @@ if TYPE_CHECKING:
from ..language import Language # noqa: F401
-class OrthVariantsSingle(BaseModel):
- tags: List[StrictStr]
- variants: List[StrictStr]
+@registry.augmenters("spacy.combined_augmenter.v1")
+def create_combined_augmenter(
+ lower_level: float,
+ orth_level: float,
+ orth_variants: Optional[Dict[str, List[Dict]]],
+ whitespace_level: float,
+ whitespace_per_token: float,
+ whitespace_variants: Optional[List[str]],
+) -> Callable[["Language", Example], Iterator[Example]]:
+ """Create a data augmentation callback that uses orth-variant replacement.
+ The callback can be added to a corpus or other data iterator during training.
+
+ lower_level (float): The percentage of texts that will be lowercased.
+ orth_level (float): The percentage of texts that will be augmented.
+ orth_variants (Optional[Dict[str, List[Dict]]]): A dictionary containing the
+ single and paired orth variants. Typically loaded from a JSON file.
+ whitespace_level (float): The percentage of texts that will have whitespace
+ tokens inserted.
+ whitespace_per_token (float): The number of whitespace tokens to insert in
+ the modified doc as a percentage of the doc length.
+ whitespace_variants (Optional[List[str]]): The whitespace token texts.
+ RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
+ """
+ return partial(
+ combined_augmenter,
+ lower_level=lower_level,
+ orth_level=orth_level,
+ orth_variants=orth_variants,
+ whitespace_level=whitespace_level,
+ whitespace_per_token=whitespace_per_token,
+ whitespace_variants=whitespace_variants,
+ )
-class OrthVariantsPaired(BaseModel):
- tags: List[StrictStr]
- variants: List[List[StrictStr]]
-
-
-class OrthVariants(BaseModel):
- paired: List[OrthVariantsPaired] = []
- single: List[OrthVariantsSingle] = []
+def combined_augmenter(
+ nlp: "Language",
+ example: Example,
+ *,
+ lower_level: float = 0.0,
+ orth_level: float = 0.0,
+ orth_variants: Optional[Dict[str, List[Dict]]] = None,
+ whitespace_level: float = 0.0,
+ whitespace_per_token: float = 0.0,
+ whitespace_variants: Optional[List[str]] = None,
+) -> Iterator[Example]:
+ if random.random() < lower_level:
+ example = make_lowercase_variant(nlp, example)
+ if orth_variants and random.random() < orth_level:
+ raw_text = example.text
+ orig_dict = example.to_dict()
+ variant_text, variant_token_annot = make_orth_variants(
+ nlp,
+ raw_text,
+ orig_dict["token_annotation"],
+ orth_variants,
+ lower=False,
+ )
+ orig_dict["token_annotation"] = variant_token_annot
+ example = example.from_dict(nlp.make_doc(variant_text), orig_dict)
+ if whitespace_variants and random.random() < whitespace_level:
+ for _ in range(int(len(example.reference) * whitespace_per_token)):
+ example = make_whitespace_variant(
+ nlp,
+ example,
+ random.choice(whitespace_variants),
+ random.randrange(0, len(example.reference)),
+ )
+ yield example
@registry.augmenters("spacy.orth_variants.v1")
def create_orth_variants_augmenter(
- level: float, lower: float, orth_variants: OrthVariants
+ level: float, lower: float, orth_variants: Dict[str, List[Dict]]
) -> Callable[["Language", Example], Iterator[Example]]:
"""Create a data augmentation callback that uses orth-variant replacement.
The callback can be added to a corpus or other data iterator during training.
level (float): The percentage of texts that will be augmented.
lower (float): The percentage of texts that will be lowercased.
- orth_variants (Dict[str, dict]): A dictionary containing the single and
- paired orth variants. Typically loaded from a JSON file.
+ orth_variants (Dict[str, List[Dict]]): A dictionary containing
+ the single and paired orth variants. Typically loaded from a JSON file.
RETURNS (Callable[[Language, Example], Iterator[Example]]): The augmenter.
"""
return partial(
@@ -67,16 +123,20 @@ def lower_casing_augmenter(
if random.random() >= level:
yield example
else:
- example_dict = example.to_dict()
- doc = nlp.make_doc(example.text.lower())
- example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
- yield example.from_dict(doc, example_dict)
+ yield make_lowercase_variant(nlp, example)
+
+
+def make_lowercase_variant(nlp: "Language", example: Example):
+ example_dict = example.to_dict()
+ doc = nlp.make_doc(example.text.lower())
+ example_dict["token_annotation"]["ORTH"] = [t.lower_ for t in example.reference]
+ return example.from_dict(doc, example_dict)
def orth_variants_augmenter(
nlp: "Language",
example: Example,
- orth_variants: Dict,
+ orth_variants: Dict[str, List[Dict]],
*,
level: float = 0.0,
lower: float = 0.0,
@@ -148,10 +208,132 @@ def make_orth_variants(
pair_idx = pair.index(words[word_idx])
words[word_idx] = punct_choices[punct_idx][pair_idx]
token_dict["ORTH"] = words
- # construct modified raw text from words and spaces
+ raw = construct_modified_raw_text(token_dict)
+ return raw, token_dict
+
+
+def make_whitespace_variant(
+ nlp: "Language",
+ example: Example,
+ whitespace: str,
+ position: int,
+) -> Example:
+ """Insert the whitespace token at the specified token offset in the doc.
+ This is primarily intended for v2-compatible training data that doesn't
+ include links or spans. If the document includes links, spans, or partial
+ dependency annotation, it is returned without modifications.
+
+ The augmentation follows the basics of the v2 space attachment policy, but
+ without a distinction between "real" and other tokens, so space tokens
+ may be attached to space tokens:
+ - at the beginning of a sentence attach the space token to the following
+ token
+ - otherwise attach the space token to the preceding token
+
+ The augmenter does not attempt to consolidate adjacent whitespace in the
+ same way that the tokenizer would.
+
+ The following annotation is used for the space token:
+ TAG: "_SP"
+ MORPH: ""
+ POS: "SPACE"
+ LEMMA: ORTH
+ DEP: "dep"
+ SENT_START: False
+
+ The annotation for each attribute is only set for the space token if there
+ is already at least partial annotation for that attribute in the original
+ example.
+
+ RETURNS (Example): Example with one additional space token.
+ """
+ example_dict = example.to_dict()
+ doc_dict = example_dict.get("doc_annotation", {})
+ token_dict = example_dict.get("token_annotation", {})
+ # returned unmodified if:
+ # - doc is empty
+ # - words are not defined
+ # - links are defined (only character-based offsets, which is more a quirk
+ # of Example.to_dict than a technical constraint)
+ # - spans are defined
+ # - there are partial dependencies
+ if (
+ len(example.reference) == 0
+ or "ORTH" not in token_dict
+ or len(doc_dict.get("links", [])) > 0
+ or len(example.reference.spans) > 0
+ or (
+ example.reference.has_annotation("DEP")
+ and not example.reference.has_annotation("DEP", require_complete=True)
+ )
+ ):
+ return example
+ words = token_dict.get("ORTH", [])
+ length = len(words)
+ assert 0 <= position <= length
+ if example.reference.has_annotation("ENT_TYPE"):
+ # I-ENTITY if between B/I-ENTITY and I/L-ENTITY otherwise O
+ entity = "O"
+ if position > 1 and position < length:
+ ent_prev = doc_dict["entities"][position - 1]
+ ent_next = doc_dict["entities"][position]
+ if "-" in ent_prev and "-" in ent_next:
+ ent_iob_prev = ent_prev.split("-")[0]
+ ent_type_prev = ent_prev.split("-", 1)[1]
+ ent_iob_next = ent_next.split("-")[0]
+ ent_type_next = ent_next.split("-", 1)[1]
+ if (
+ ent_iob_prev in ("B", "I")
+ and ent_iob_next in ("I", "L")
+ and ent_type_prev == ent_type_next
+ ):
+ entity = f"I-{ent_type_prev}"
+ doc_dict["entities"].insert(position, entity)
+ else:
+ del doc_dict["entities"]
+ token_dict["ORTH"].insert(position, whitespace)
+ token_dict["SPACY"].insert(position, False)
+ if example.reference.has_annotation("TAG"):
+ token_dict["TAG"].insert(position, "_SP")
+ else:
+ del token_dict["TAG"]
+ if example.reference.has_annotation("LEMMA"):
+ token_dict["LEMMA"].insert(position, whitespace)
+ else:
+ del token_dict["LEMMA"]
+ if example.reference.has_annotation("POS"):
+ token_dict["POS"].insert(position, "SPACE")
+ else:
+ del token_dict["POS"]
+ if example.reference.has_annotation("MORPH"):
+ token_dict["MORPH"].insert(position, "")
+ else:
+ del token_dict["MORPH"]
+ if example.reference.has_annotation("DEP", require_complete=True):
+ if position == 0:
+ token_dict["HEAD"].insert(position, 0)
+ else:
+ token_dict["HEAD"].insert(position, position - 1)
+ for i in range(len(token_dict["HEAD"])):
+ if token_dict["HEAD"][i] >= position:
+ token_dict["HEAD"][i] += 1
+ token_dict["DEP"].insert(position, "dep")
+ else:
+ del token_dict["HEAD"]
+ del token_dict["DEP"]
+ if example.reference.has_annotation("SENT_START"):
+ token_dict["SENT_START"].insert(position, False)
+ else:
+ del token_dict["SENT_START"]
+ raw = construct_modified_raw_text(token_dict)
+ return Example.from_dict(nlp.make_doc(raw), example_dict)
+
+
+def construct_modified_raw_text(token_dict):
+ """Construct modified raw text from words and spaces."""
raw = ""
for orth, spacy in zip(token_dict["ORTH"], token_dict["SPACY"]):
raw += orth
if spacy:
raw += " "
- return raw, token_dict
+ return raw
diff --git a/spacy/training/converters/conllu_to_docs.py b/spacy/training/converters/conllu_to_docs.py
index 7a4f44d3b..7052504cc 100644
--- a/spacy/training/converters/conllu_to_docs.py
+++ b/spacy/training/converters/conllu_to_docs.py
@@ -71,6 +71,7 @@ def read_conllx(
):
"""Yield docs, one for each sentence"""
vocab = Vocab() # need vocab to make a minimal Doc
+ set_ents = has_ner(input_data, ner_tag_pattern)
for sent in input_data.strip().split("\n\n"):
lines = sent.strip().split("\n")
if lines:
@@ -83,6 +84,7 @@ def read_conllx(
merge_subtokens=merge_subtokens,
append_morphology=append_morphology,
ner_map=ner_map,
+ set_ents=set_ents,
)
yield doc
@@ -133,6 +135,7 @@ def conllu_sentence_to_doc(
merge_subtokens=False,
append_morphology=False,
ner_map=None,
+ set_ents=False,
):
"""Create an Example from the lines for one CoNLL-U sentence, merging
subtokens and appending morphology to tags if required.
@@ -214,8 +217,10 @@ def conllu_sentence_to_doc(
doc[i]._.merged_morph = morphs[i]
doc[i]._.merged_lemma = lemmas[i]
doc[i]._.merged_spaceafter = spaces[i]
- ents = get_entities(lines, ner_tag_pattern, ner_map)
- doc.ents = biluo_tags_to_spans(doc, ents)
+ ents = None
+ if set_ents:
+ ents = get_entities(lines, ner_tag_pattern, ner_map)
+ doc.ents = biluo_tags_to_spans(doc, ents)
if merge_subtokens:
doc = merge_conllu_subtokens(lines, doc)
@@ -247,7 +252,10 @@ def conllu_sentence_to_doc(
deps=deps,
heads=heads,
)
- doc_x.ents = [Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents]
+ if set_ents:
+ doc_x.ents = [
+ Span(doc_x, ent.start, ent.end, label=ent.label) for ent in doc.ents
+ ]
return doc_x
diff --git a/spacy/training/example.pyx b/spacy/training/example.pyx
index d792c9bbf..ab92f78c6 100644
--- a/spacy/training/example.pyx
+++ b/spacy/training/example.pyx
@@ -159,7 +159,7 @@ cdef class Example:
gold_values = self.reference.to_array([field])
output = [None] * len(self.predicted)
for token in self.predicted:
- values = gold_values[align[token.i].dataXd]
+ values = gold_values[align[token.i]]
values = values.ravel()
if len(values) == 0:
output[token.i] = None
@@ -190,9 +190,9 @@ cdef class Example:
deps = [d if has_deps[i] else deps[i] for i, d in enumerate(proj_deps)]
for cand_i in range(self.x.length):
if cand_to_gold.lengths[cand_i] == 1:
- gold_i = cand_to_gold[cand_i].dataXd[0, 0]
+ gold_i = cand_to_gold[cand_i][0]
if gold_to_cand.lengths[heads[gold_i]] == 1:
- aligned_heads[cand_i] = int(gold_to_cand[heads[gold_i]].dataXd[0, 0])
+ aligned_heads[cand_i] = int(gold_to_cand[heads[gold_i]][0])
aligned_deps[cand_i] = deps[gold_i]
return aligned_heads, aligned_deps
@@ -204,7 +204,7 @@ cdef class Example:
align = self.alignment.y2x
sent_starts = [False] * len(self.x)
for y_sent in self.y.sents:
- x_start = int(align[y_sent.start].dataXd[0])
+ x_start = int(align[y_sent.start][0])
sent_starts[x_start] = True
return sent_starts
else:
@@ -220,7 +220,7 @@ cdef class Example:
seen = set()
output = []
for span in spans:
- indices = align[span.start : span.end].data.ravel()
+ indices = align[span.start : span.end]
if not allow_overlap:
indices = [idx for idx in indices if idx not in seen]
if len(indices) >= 1:
@@ -256,6 +256,29 @@ cdef class Example:
x_ents, x_tags = self.get_aligned_ents_and_ner()
return x_tags
+ def get_matching_ents(self, check_label=True):
+ """Return entities that are shared between predicted and reference docs.
+
+ If `check_label` is True, entities must have matching labels to be
+ kept. Otherwise only the character indices need to match.
+ """
+ gold = {}
+ for ent in self.reference.ents:
+ gold[(ent.start_char, ent.end_char)] = ent.label
+
+ keep = []
+ for ent in self.predicted.ents:
+ key = (ent.start_char, ent.end_char)
+ if key not in gold:
+ continue
+
+ if check_label and ent.label != gold[key]:
+ continue
+
+ keep.append(ent)
+
+ return keep
+
def to_dict(self):
return {
"doc_annotation": {
@@ -293,7 +316,7 @@ cdef class Example:
seen_indices = set()
output = []
for y_sent in self.reference.sents:
- indices = align[y_sent.start : y_sent.end].data.ravel()
+ indices = align[y_sent.start : y_sent.end]
indices = [idx for idx in indices if idx not in seen_indices]
if indices:
x_sent = self.predicted[indices[0] : indices[-1] + 1]
diff --git a/spacy/training/initialize.py b/spacy/training/initialize.py
index b59288e38..48ff7b589 100644
--- a/spacy/training/initialize.py
+++ b/spacy/training/initialize.py
@@ -213,6 +213,7 @@ def convert_vectors(
for lex in nlp.vocab:
if lex.rank and lex.rank != OOV_RANK:
nlp.vocab.vectors.add(lex.orth, row=lex.rank) # type: ignore[attr-defined]
+ nlp.vocab.deduplicate_vectors()
else:
if vectors_loc:
logger.info(f"Reading vectors from {vectors_loc}")
@@ -239,6 +240,7 @@ def convert_vectors(
nlp.vocab.vectors = Vectors(
strings=nlp.vocab.strings, data=vectors_data, keys=vector_keys
)
+ nlp.vocab.deduplicate_vectors()
if name is None:
# TODO: Is this correct? Does this matter?
nlp.vocab.vectors.name = f"{nlp.meta['lang']}_{nlp.meta['name']}.vectors"
diff --git a/spacy/util.py b/spacy/util.py
index 14714143c..66e257dd8 100644
--- a/spacy/util.py
+++ b/spacy/util.py
@@ -485,13 +485,16 @@ def load_model_from_path(
config_path = model_path / "config.cfg"
overrides = dict_to_dot(config)
config = load_config(config_path, overrides=overrides)
- nlp = load_model_from_config(config, vocab=vocab, disable=disable, exclude=exclude)
+ nlp = load_model_from_config(
+ config, vocab=vocab, disable=disable, exclude=exclude, meta=meta
+ )
return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)
def load_model_from_config(
config: Union[Dict[str, Any], Config],
*,
+ meta: Dict[str, Any] = SimpleFrozenDict(),
vocab: Union["Vocab", bool] = True,
disable: Iterable[str] = SimpleFrozenList(),
exclude: Iterable[str] = SimpleFrozenList(),
@@ -529,6 +532,7 @@ def load_model_from_config(
exclude=exclude,
auto_fill=auto_fill,
validate=validate,
+ meta=meta,
)
return nlp
@@ -871,7 +875,6 @@ def get_package_path(name: str) -> Path:
name (str): Package name.
RETURNS (Path): Path to installed package.
"""
- name = name.lower() # use lowercase version to be safe
# Here we're importing the module just to find it. This is worryingly
# indirect, but it's otherwise very difficult to find the package.
pkg = importlib.import_module(name)
diff --git a/spacy/vectors.pyx b/spacy/vectors.pyx
index bc4863703..bcba9d03f 100644
--- a/spacy/vectors.pyx
+++ b/spacy/vectors.pyx
@@ -170,6 +170,8 @@ cdef class Vectors:
DOCS: https://spacy.io/api/vectors#n_keys
"""
+ if self.mode == Mode.floret:
+ return -1
return len(self.key2row)
def __reduce__(self):
@@ -563,8 +565,9 @@ cdef class Vectors:
# the source of numpy.save indicates that the file object is closed after use.
# but it seems that somehow this does not happen, as ResourceWarnings are raised here.
# in order to not rely on this, wrap in context manager.
+ ops = get_current_ops()
with path.open("wb") as _file:
- save_array(self.data, _file)
+ save_array(ops.to_numpy(self.data, byte_order="<"), _file)
serializers = {
"strings": lambda p: self.strings.to_disk(p.with_suffix(".json")),
@@ -600,6 +603,7 @@ cdef class Vectors:
ops = get_current_ops()
if path.exists():
self.data = ops.xp.load(str(path))
+ self.to_ops(ops)
def load_settings(path):
if path.exists():
@@ -629,7 +633,8 @@ cdef class Vectors:
if hasattr(self.data, "to_bytes"):
return self.data.to_bytes()
else:
- return srsly.msgpack_dumps(self.data)
+ ops = get_current_ops()
+ return srsly.msgpack_dumps(ops.to_numpy(self.data, byte_order="<"))
serializers = {
"strings": lambda: self.strings.to_bytes(),
@@ -654,6 +659,8 @@ cdef class Vectors:
else:
xp = get_array_module(self.data)
self.data = xp.asarray(srsly.msgpack_loads(b))
+ ops = get_current_ops()
+ self.to_ops(ops)
deserializers = {
"strings": lambda b: self.strings.from_bytes(b),
diff --git a/spacy/vocab.pyi b/spacy/vocab.pyi
index 713e85c01..4cc359c47 100644
--- a/spacy/vocab.pyi
+++ b/spacy/vocab.pyi
@@ -46,6 +46,7 @@ class Vocab:
def reset_vectors(
self, *, width: Optional[int] = ..., shape: Optional[int] = ...
) -> None: ...
+ def deduplicate_vectors(self) -> None: ...
def prune_vectors(self, nr_row: int, batch_size: int = ...) -> Dict[str, float]: ...
def get_vector(
self,
diff --git a/spacy/vocab.pyx b/spacy/vocab.pyx
index badd291ed..428cadd82 100644
--- a/spacy/vocab.pyx
+++ b/spacy/vocab.pyx
@@ -1,6 +1,7 @@
# cython: profile=True
from libc.string cimport memcpy
+import numpy
import srsly
from thinc.api import get_array_module, get_current_ops
import functools
@@ -297,6 +298,33 @@ cdef class Vocab:
width = width if width is not None else self.vectors.shape[1]
self.vectors = Vectors(strings=self.strings, shape=(self.vectors.shape[0], width))
+ def deduplicate_vectors(self):
+ if self.vectors.mode != VectorsMode.default:
+ raise ValueError(Errors.E858.format(
+ mode=self.vectors.mode,
+ alternative=""
+ ))
+ ops = get_current_ops()
+ xp = get_array_module(self.vectors.data)
+ filled = xp.asarray(
+ sorted(list({row for row in self.vectors.key2row.values()}))
+ )
+ # deduplicate data and remap keys
+ data = numpy.unique(ops.to_numpy(self.vectors.data[filled]), axis=0)
+ data = ops.asarray(data)
+ if data.shape == self.vectors.data.shape:
+ # nothing to deduplicate
+ return
+ row_by_bytes = {row.tobytes(): i for i, row in enumerate(data)}
+ key2row = {
+ key: row_by_bytes[self.vectors.data[row].tobytes()]
+ for key, row in self.vectors.key2row.items()
+ }
+ # replace vectors with deduplicated version
+ self.vectors = Vectors(strings=self.strings, data=data, name=self.vectors.name)
+ for key, row in key2row.items():
+ self.vectors.add(key, row=row)
+
def prune_vectors(self, nr_row, batch_size=1024):
"""Reduce the current vector table to `nr_row` unique entries. Words
mapped to the discarded vectors will be remapped to the closest vector
@@ -325,7 +353,10 @@ cdef class Vocab:
DOCS: https://spacy.io/api/vocab#prune_vectors
"""
if self.vectors.mode != VectorsMode.default:
- raise ValueError(Errors.E866)
+ raise ValueError(Errors.E858.format(
+ mode=self.vectors.mode,
+ alternative=""
+ ))
ops = get_current_ops()
xp = get_array_module(self.vectors.data)
# Make sure all vectors are in the vocab
@@ -354,8 +385,9 @@ cdef class Vocab:
def get_vector(self, orth):
"""Retrieve a vector for a word in the vocabulary. Words can be looked
- up by string or int ID. If no vectors data is loaded, ValueError is
- raised.
+ up by string or int ID. If the current vectors do not contain an entry
+ for the word, a 0-vector with the same number of dimensions as the
+ current vectors is returned.
orth (int / unicode): The hash value of a word, or its unicode string.
RETURNS (numpy.ndarray or cupy.ndarray): A word vector. Size
diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md
index 07b76393f..2bddcb28c 100644
--- a/website/docs/api/architectures.md
+++ b/website/docs/api/architectures.md
@@ -104,7 +104,7 @@ consisting of a CNN and a layer-normalized maxout activation function.
> factory = "tagger"
>
> [components.tagger.model]
-> @architectures = "spacy.Tagger.v1"
+> @architectures = "spacy.Tagger.v2"
>
> [components.tagger.model.tok2vec]
> @architectures = "spacy.Tok2VecListener.v1"
@@ -158,8 +158,8 @@ be configured with the `attrs` argument. The suggested attributes are `NORM`,
`PREFIX`, `SUFFIX` and `SHAPE`. This lets the model take into account some
subword information, without construction a fully character-based
representation. If pretrained vectors are available, they can be included in the
-representation as well, with the vectors table kept static (i.e. it's
-not updated).
+representation as well, with the vectors table kept static (i.e. it's not
+updated).
| Name | Description |
| ------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
@@ -613,14 +613,15 @@ same signature, but the `use_upper` argument was `True` by default.
## Tagging architectures {#tagger source="spacy/ml/models/tagger.py"}
-### spacy.Tagger.v1 {#Tagger}
+### spacy.Tagger.v2 {#Tagger}
> #### Example Config
>
> ```ini
> [model]
-> @architectures = "spacy.Tagger.v1"
+> @architectures = "spacy.Tagger.v2"
> nO = null
+> normalize = false
>
> [model.tok2vec]
> # ...
@@ -634,8 +635,18 @@ the token vectors.
| ----------- | ------------------------------------------------------------------------------------------ |
| `tok2vec` | Subnetwork to map tokens into vector representations. ~~Model[List[Doc], List[Floats2d]]~~ |
| `nO` | The number of tags to output. Inferred from the data if `None`. ~~Optional[int]~~ |
+| `normalize` | Normalize probabilities during inference. Defaults to `False`. ~~bool~~ |
| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ |
+
+
+- The `normalize` argument was added in `spacy.Tagger.v2`. `spacy.Tagger.v1`
+ always normalizes probabilities during inference.
+
+The other arguments are shared between all versions.
+
+
+
## Text classification architectures {#textcat source="spacy/ml/models/textcat.py"}
A text classification architecture needs to take a [`Doc`](/api/doc) as input,
@@ -858,13 +869,13 @@ into the "real world". This requires 3 main components:
- A machine learning [`Model`](https://thinc.ai/docs/api-model) that picks the
most plausible ID from the set of candidates.
-### spacy.EntityLinker.v1 {#EntityLinker}
+### spacy.EntityLinker.v2 {#EntityLinker}
> #### Example Config
>
> ```ini
> [model]
-> @architectures = "spacy.EntityLinker.v1"
+> @architectures = "spacy.EntityLinker.v2"
> nO = null
>
> [model.tok2vec]
diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md
index 89e2e87d9..e801ff0a6 100644
--- a/website/docs/api/cli.md
+++ b/website/docs/api/cli.md
@@ -626,6 +626,235 @@ will not be available.
| overrides | Config parameters to override. Should be options starting with `--` that correspond to the config section and value to override, e.g. `--paths.train ./train.spacy`. ~~Any (option/flag)~~ |
| **PRINTS** | Debugging information. |
+### debug diff-config {#debug-diff tag="command"}
+
+Show a diff of a config file with respect to spaCy's defaults or another config
+file. If additional settings were used in the creation of the config file, then
+you must supply these as extra parameters to the command when comparing to the
+default settings. The generated diff can also be used when posting to the
+discussion forum to provide more information for the maintainers.
+
+```cli
+$ python -m spacy debug diff-config [config_path] [--compare-to] [--optimize] [--gpu] [--pretraining] [--markdown]
+```
+
+> #### Example
+>
+> ```cli
+> $ python -m spacy debug diff-config ./config.cfg
+> ```
+
+
+
+```
+โน Found user-defined language: 'en'
+โน Found user-defined pipelines: ['tok2vec', 'tagger', 'parser',
+'ner']
+[paths]
++ train = "./data/train.spacy"
++ dev = "./data/dev.spacy"
+- train = null
+- dev = null
+vectors = null
+init_tok2vec = null
+
+[system]
+gpu_allocator = null
++ seed = 42
+- seed = 0
+
+[nlp]
+lang = "en"
+pipeline = ["tok2vec","tagger","parser","ner"]
+batch_size = 1000
+disabled = []
+before_creation = null
+after_creation = null
+after_pipeline_creation = null
+tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}
+
+[components]
+
+[components.ner]
+factory = "ner"
+incorrect_spans_key = null
+moves = null
+scorer = {"@scorers":"spacy.ner_scorer.v1"}
+update_with_oracle_cut_size = 100
+
+[components.ner.model]
+@architectures = "spacy.TransitionBasedParser.v2"
+state_type = "ner"
+extra_state_tokens = false
+- hidden_width = 64
++ hidden_width = 36
+maxout_pieces = 2
+use_upper = true
+nO = null
+
+[components.ner.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+upstream = "*"
+
+[components.parser]
+factory = "parser"
+learn_tokens = false
+min_action_freq = 30
+moves = null
+scorer = {"@scorers":"spacy.parser_scorer.v1"}
+update_with_oracle_cut_size = 100
+
+[components.parser.model]
+@architectures = "spacy.TransitionBasedParser.v2"
+state_type = "parser"
+extra_state_tokens = false
+hidden_width = 128
+maxout_pieces = 3
+use_upper = true
+nO = null
+
+[components.parser.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+upstream = "*"
+
+[components.tagger]
+factory = "tagger"
+neg_prefix = "!"
+overwrite = false
+scorer = {"@scorers":"spacy.tagger_scorer.v1"}
+
+[components.tagger.model]
+@architectures = "spacy.Tagger.v1"
+nO = null
+
+[components.tagger.model.tok2vec]
+@architectures = "spacy.Tok2VecListener.v1"
+width = ${components.tok2vec.model.encode.width}
+upstream = "*"
+
+[components.tok2vec]
+factory = "tok2vec"
+
+[components.tok2vec.model]
+@architectures = "spacy.Tok2Vec.v2"
+
+[components.tok2vec.model.embed]
+@architectures = "spacy.MultiHashEmbed.v2"
+width = ${components.tok2vec.model.encode.width}
+attrs = ["NORM","PREFIX","SUFFIX","SHAPE"]
+rows = [5000,2500,2500,2500]
+include_static_vectors = false
+
+[components.tok2vec.model.encode]
+@architectures = "spacy.MaxoutWindowEncoder.v2"
+width = 96
+depth = 4
+window_size = 1
+maxout_pieces = 3
+
+[corpora]
+
+[corpora.dev]
+@readers = "spacy.Corpus.v1"
+path = ${paths.dev}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[corpora.train]
+@readers = "spacy.Corpus.v1"
+path = ${paths.train}
+max_length = 0
+gold_preproc = false
+limit = 0
+augmenter = null
+
+[training]
+dev_corpus = "corpora.dev"
+train_corpus = "corpora.train"
+seed = ${system.seed}
+gpu_allocator = ${system.gpu_allocator}
+dropout = 0.1
+accumulate_gradient = 1
+patience = 1600
+max_epochs = 0
+max_steps = 20000
+eval_frequency = 200
+frozen_components = []
+annotating_components = []
+before_to_disk = null
+
+[training.batcher]
+@batchers = "spacy.batch_by_words.v1"
+discard_oversize = false
+tolerance = 0.2
+get_length = null
+
+[training.batcher.size]
+@schedules = "compounding.v1"
+start = 100
+stop = 1000
+compound = 1.001
+t = 0.0
+
+[training.logger]
+@loggers = "spacy.ConsoleLogger.v1"
+progress_bar = false
+
+[training.optimizer]
+@optimizers = "Adam.v1"
+beta1 = 0.9
+beta2 = 0.999
+L2_is_weight_decay = true
+L2 = 0.01
+grad_clip = 1.0
+use_averages = false
+eps = 0.00000001
+learn_rate = 0.001
+
+[training.score_weights]
+tag_acc = 0.33
+dep_uas = 0.17
+dep_las = 0.17
+dep_las_per_type = null
+sents_p = null
+sents_r = null
+sents_f = 0.0
+ents_f = 0.33
+ents_p = 0.0
+ents_r = 0.0
+ents_per_type = null
+
+[pretraining]
+
+[initialize]
+vectors = ${paths.vectors}
+init_tok2vec = ${paths.init_tok2vec}
+vocab_data = null
+lookups = null
+before_init = null
+after_init = null
+
+[initialize.components]
+
+[initialize.tokenizer]
+```
+
+
+
+| Name | Description |
+| -------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `config_path` | Path to [training config](/api/data-formats#config) file containing all settings and hyperparameters. ~~Union[Path, str] \(positional)~~ |
+| `compare_to` | Path to another config file to diff against, or `None` to compare against default settings. ~~Optional[Union[Path, str] \(option)~~ |
+| `optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether the config was optimized for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). Only relevant when comparing against a default config. Defaults to `"efficiency"`. ~~str (option)~~ |
+| `gpu`, `-G` | Whether the config was made to run on a GPU. Only relevant when comparing against a default config. ~~bool (flag)~~ |
+| `pretraining`, `-pt` | Include config for pretraining (with [`spacy pretrain`](/api/cli#pretrain)). Only relevant when comparing against a default config. Defaults to `False`. ~~bool (flag)~~ |
+| `markdown`, `-md` | Generate Markdown for Github issues. Defaults to `False`. ~~bool (flag)~~ |
+| **PRINTS** | Diff between the two config files. |
+
### debug profile {#debug-profile tag="command"}
Profile which functions take the most time in a spaCy pipeline. Input should be
diff --git a/website/docs/api/corpus.md b/website/docs/api/corpus.md
index 986c6f458..35afc8fea 100644
--- a/website/docs/api/corpus.md
+++ b/website/docs/api/corpus.md
@@ -79,6 +79,7 @@ train/test skew.
| `max_length` | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. ~~int~~ |
| `limit` | Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. ~~int~~ |
| `augmenter` | Optional data augmentation callback. ~~Callable[[Language, Example], Iterable[Example]]~~ |
+| `shuffle` | Whether to shuffle the examples. Defaults to `False`. ~~bool~~ |
## Corpus.\_\_call\_\_ {#call tag="method"}
diff --git a/website/docs/api/dependencyparser.md b/website/docs/api/dependencyparser.md
index 118cdc611..103e0826e 100644
--- a/website/docs/api/dependencyparser.md
+++ b/website/docs/api/dependencyparser.md
@@ -100,7 +100,7 @@ shortcut for this and instantiate the component using its string name and
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
-| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[List[str]]~~ |
+| `moves` | A list of transition names. Inferred from the data if not provided. ~~Optional[TransitionSystem]~~ |
| _keyword-only_ | |
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
| `learn_tokens` | Whether to learn to merge subtokens that are split relative to the gold standard. Experimental. Defaults to `False`. ~~bool~~ |
diff --git a/website/docs/api/doc.md b/website/docs/api/doc.md
index 9836b8c21..0008cde31 100644
--- a/website/docs/api/doc.md
+++ b/website/docs/api/doc.md
@@ -34,7 +34,7 @@ Construct a `Doc` object. The most common way to get a `Doc` object is via the
| Name | Description |
| ---------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `vocab` | A storage container for lexical types. ~~Vocab~~ |
-| `words` | A list of strings or integer hash values to add to the document as words. ~~Optional[List[Union[str,int]]]~~ |
+| `words` | A list of strings or integer hash values to add to the document as words. ~~Optional[List[Union[str,int]]]~~ |
| `spaces` | A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. ~~Optional[List[bool]]~~ |
| _keyword-only_ | |
| `user\_data` | Optional extra data to attach to the Doc. ~~Dict~~ |
@@ -304,7 +304,8 @@ ancestor is found, e.g. if span excludes a necessary ancestor.
## Doc.has_annotation {#has_annotation tag="method"}
-Check whether the doc contains annotation on a token attribute.
+Check whether the doc contains annotation on a
+[`Token` attribute](/api/token#attributes).
@@ -398,12 +399,14 @@ Concatenate multiple `Doc` objects to form a new one. Raises an error if the
> [str(ent) for doc in docs for ent in doc.ents]
> ```
-| Name | Description |
-| ------------------- | ----------------------------------------------------------------------------------------------------------------- |
-| `docs` | A list of `Doc` objects. ~~List[Doc]~~ |
-| `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ |
-| `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ |
-| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ |
+| Name | Description |
+| -------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
+| `docs` | A list of `Doc` objects. ~~List[Doc]~~ |
+| `ensure_whitespace` | Insert a space between two adjacent docs whenever the first doc does not end in whitespace. ~~bool~~ |
+| `attrs` | Optional list of attribute ID ints or attribute name strings. ~~Optional[List[Union[str, int]]]~~ |
+| _keyword-only_ | |
+| `exclude` 3.3 | String names of Doc attributes to exclude. Supported: `spans`, `tensor`, `user_data`. ~~Iterable[str]~~ |
+| **RETURNS** | The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. ~~Optional[Doc]~~ |
## Doc.to_disk {#to_disk tag="method" new="2"}
@@ -585,7 +588,7 @@ objects or a [`SpanGroup`](/api/spangroup) to a given key.
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> ```
| Name | Description |
@@ -618,7 +621,7 @@ relative clauses.
To customize the noun chunk iterator in a loaded pipeline, modify
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
-[syntax iterator](/usage/adding-languages#language-data) has not been
+[syntax iterator](/usage/linguistic-features#language-data) has not been
implemented for the given language, a `NotImplementedError` is raised.
> #### Example
diff --git a/website/docs/api/edittreelemmatizer.md b/website/docs/api/edittreelemmatizer.md
new file mode 100644
index 000000000..99a705f5e
--- /dev/null
+++ b/website/docs/api/edittreelemmatizer.md
@@ -0,0 +1,409 @@
+---
+title: EditTreeLemmatizer
+tag: class
+source: spacy/pipeline/edit_tree_lemmatizer.py
+new: 3.3
+teaser: 'Pipeline component for lemmatization'
+api_base_class: /api/pipe
+api_string_name: trainable_lemmatizer
+api_trainable: true
+---
+
+A trainable component for assigning base forms to tokens. This lemmatizer uses
+**edit trees** to transform tokens into base forms. The lemmatization model
+predicts which edit tree is applicable to a token. The edit tree data structure
+and construction method used by this lemmatizer were proposed in
+[Joint Lemmatization and Morphological Tagging with Lemming](https://aclanthology.org/D15-1272.pdf)
+(Thomas Mรผller et al., 2015).
+
+For a lookup and rule-based lemmatizer, see [`Lemmatizer`](/api/lemmatizer).
+
+## Assigned Attributes {#assigned-attributes}
+
+Predictions are assigned to `Token.lemma`.
+
+| Location | Value |
+| -------------- | ------------------------- |
+| `Token.lemma` | The lemma (hash). ~~int~~ |
+| `Token.lemma_` | The lemma. ~~str~~ |
+
+## Config and implementation {#config}
+
+The default config is defined by the pipeline component factory and describes
+how the component should be configured. You can override its settings via the
+`config` argument on [`nlp.add_pipe`](/api/language#add_pipe) or in your
+[`config.cfg` for training](/usage/training#config). See the
+[model architectures](/api/architectures) documentation for details on the
+architectures and their arguments and hyperparameters.
+
+> #### Example
+>
+> ```python
+> from spacy.pipeline.edit_tree_lemmatizer import DEFAULT_EDIT_TREE_LEMMATIZER_MODEL
+> config = {"model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL}
+> nlp.add_pipe("trainable_lemmatizer", config=config, name="lemmatizer")
+> ```
+
+| Setting | Description |
+| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
+| `model` | A model instance that predicts the edit tree probabilities. The output vectors should match the number of edit trees in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). Defaults to [Tagger](/api/architectures#Tagger). ~~Model[List[Doc], List[Floats2d]]~~ |
+| `backoff` | ~~Token~~ attribute to use when no applicable edit tree is found. Defaults to `orth`. ~~str~~ |
+| `min_tree_freq` | Minimum frequency of an edit tree in the training set to be used. Defaults to `3`. ~~int~~ |
+| `overwrite` | Whether existing annotation is overwritten. Defaults to `False`. ~~bool~~ |
+| `top_k` | The number of most probable edit trees to try before resorting to `backoff`. Defaults to `1`. ~~int~~ |
+| `scorer` | The scoring method. Defaults to [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attribute `"lemma"`. ~~Optional[Callable]~~ |
+
+```python
+%%GITHUB_SPACY/spacy/pipeline/edit_tree_lemmatizer.py
+```
+
+## EditTreeLemmatizer.\_\_init\_\_ {#init tag="method"}
+
+> #### Example
+>
+> ```python
+> # Construction via add_pipe with default model
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+>
+> # Construction via create_pipe with custom model
+> config = {"model": {"@architectures": "my_tagger"}}
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", config=config, name="lemmatizer")
+>
+> # Construction from class
+> from spacy.pipeline import EditTreeLemmatizer
+> lemmatizer = EditTreeLemmatizer(nlp.vocab, model)
+> ```
+
+Create a new pipeline instance. In your application, you would normally use a
+shortcut for this and instantiate the component using its string name and
+[`nlp.add_pipe`](/api/language#add_pipe).
+
+| Name | Description |
+| --------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab` | The shared vocabulary. ~~Vocab~~ |
+| `model` | A model instance that predicts the edit tree probabilities. The output vectors should match the number of edit trees in size, and be normalized as probabilities (all scores between 0 and 1, with the rows summing to `1`). ~~Model[List[Doc], List[Floats2d]]~~ |
+| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
+| _keyword-only_ | |
+| `backoff` | ~~Token~~ attribute to use when no applicable edit tree is found. Defaults to `orth`. ~~str~~ |
+| `min_tree_freq` | Minimum frequency of an edit tree in the training set to be used. Defaults to `3`. ~~int~~ |
+| `overwrite` | Whether existing annotation is overwritten. Defaults to `False`. ~~bool~~ |
+| `top_k` | The number of most probable edit trees to try before resorting to `backoff`. Defaults to `1`. ~~int~~ |
+| `scorer` | The scoring method. Defaults to [`Scorer.score_token_attr`](/api/scorer#score_token_attr) for the attribute `"lemma"`. ~~Optional[Callable]~~ |
+
+## EditTreeLemmatizer.\_\_call\_\_ {#call tag="method"}
+
+Apply the pipe to one document. The document is modified in place, and returned.
+This usually happens under the hood when the `nlp` object is called on a text
+and all pipeline components are applied to the `Doc` in order. Both
+[`__call__`](/api/edittreelemmatizer#call) and
+[`pipe`](/api/edittreelemmatizer#pipe) delegate to the
+[`predict`](/api/edittreelemmatizer#predict) and
+[`set_annotations`](/api/edittreelemmatizer#set_annotations) methods.
+
+> #### Example
+>
+> ```python
+> doc = nlp("This is a sentence.")
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> # This usually happens under the hood
+> processed = lemmatizer(doc)
+> ```
+
+| Name | Description |
+| ----------- | -------------------------------- |
+| `doc` | The document to process. ~~Doc~~ |
+| **RETURNS** | The processed document. ~~Doc~~ |
+
+## EditTreeLemmatizer.pipe {#pipe tag="method"}
+
+Apply the pipe to a stream of documents. This usually happens under the hood
+when the `nlp` object is called on a text and all pipeline components are
+applied to the `Doc` in order. Both [`__call__`](/api/edittreelemmatizer#call)
+and [`pipe`](/api/edittreelemmatizer#pipe) delegate to the
+[`predict`](/api/edittreelemmatizer#predict) and
+[`set_annotations`](/api/edittreelemmatizer#set_annotations) methods.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> for doc in lemmatizer.pipe(docs, batch_size=50):
+> pass
+> ```
+
+| Name | Description |
+| -------------- | ------------------------------------------------------------- |
+| `stream` | A stream of documents. ~~Iterable[Doc]~~ |
+| _keyword-only_ | |
+| `batch_size` | The number of documents to buffer. Defaults to `128`. ~~int~~ |
+| **YIELDS** | The processed documents in order. ~~Doc~~ |
+
+## EditTreeLemmatizer.initialize {#initialize tag="method" new="3"}
+
+Initialize the component for training. `get_examples` should be a function that
+returns an iterable of [`Example`](/api/example) objects. The data examples are
+used to **initialize the model** of the component and can either be the full
+training data or a representative sample. Initialization includes validating the
+network,
+[inferring missing shapes](https://thinc.ai/docs/usage-models#validation) and
+setting up the label scheme based on the data. This method is typically called
+by [`Language.initialize`](/api/language#initialize) and lets you customize
+arguments it receives via the
+[`[initialize.components]`](/api/data-formats#config-initialize) block in the
+config.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> lemmatizer.initialize(lambda: [], nlp=nlp)
+> ```
+>
+> ```ini
+> ### config.cfg
+> [initialize.components.lemmatizer]
+>
+> [initialize.components.lemmatizer.labels]
+> @readers = "spacy.read_labels.v1"
+> path = "corpus/labels/lemmatizer.json
+> ```
+
+| Name | Description |
+| -------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `get_examples` | Function that returns gold-standard annotations in the form of [`Example`](/api/example) objects. ~~Callable[[], Iterable[Example]]~~ |
+| _keyword-only_ | |
+| `nlp` | The current `nlp` object. Defaults to `None`. ~~Optional[Language]~~ |
+| `labels` | The label information to add to the component, as provided by the [`label_data`](#label_data) property after initialization. To generate a reusable JSON file from your data, you should run the [`init labels`](/api/cli#init-labels) command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. ~~Optional[Iterable[str]]~~ |
+
+## EditTreeLemmatizer.predict {#predict tag="method"}
+
+Apply the component's model to a batch of [`Doc`](/api/doc) objects, without
+modifying them.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> tree_ids = lemmatizer.predict([doc1, doc2])
+> ```
+
+| Name | Description |
+| ----------- | ------------------------------------------- |
+| `docs` | The documents to predict. ~~Iterable[Doc]~~ |
+| **RETURNS** | The model's prediction for each document. |
+
+## EditTreeLemmatizer.set_annotations {#set_annotations tag="method"}
+
+Modify a batch of [`Doc`](/api/doc) objects, using pre-computed tree
+identifiers.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> tree_ids = lemmatizer.predict([doc1, doc2])
+> lemmatizer.set_annotations([doc1, doc2], tree_ids)
+> ```
+
+| Name | Description |
+| ---------- | ------------------------------------------------------------------------------------- |
+| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
+| `tree_ids` | The identifiers of the edit trees to apply, produced by `EditTreeLemmatizer.predict`. |
+
+## EditTreeLemmatizer.update {#update tag="method"}
+
+Learn from a batch of [`Example`](/api/example) objects containing the
+predictions and gold-standard annotations, and update the component's model.
+Delegates to [`predict`](/api/edittreelemmatizer#predict) and
+[`get_loss`](/api/edittreelemmatizer#get_loss).
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> optimizer = nlp.initialize()
+> losses = lemmatizer.update(examples, sgd=optimizer)
+> ```
+
+| Name | Description |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------ |
+| `examples` | A batch of [`Example`](/api/example) objects to learn from. ~~Iterable[Example]~~ |
+| _keyword-only_ | |
+| `drop` | The dropout rate. ~~float~~ |
+| `sgd` | An optimizer. Will be created via [`create_optimizer`](#create_optimizer) if not set. ~~Optional[Optimizer]~~ |
+| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
+| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
+
+## EditTreeLemmatizer.get_loss {#get_loss tag="method"}
+
+Find the loss and gradient of loss for the batch of documents and their
+predicted scores.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> scores = lemmatizer.model.begin_update([eg.predicted for eg in examples])
+> loss, d_loss = lemmatizer.get_loss(examples, scores)
+> ```
+
+| Name | Description |
+| ----------- | --------------------------------------------------------------------------- |
+| `examples` | The batch of examples. ~~Iterable[Example]~~ |
+| `scores` | Scores representing the model's predictions. |
+| **RETURNS** | The loss and the gradient, i.e. `(loss, gradient)`. ~~Tuple[float, float]~~ |
+
+## EditTreeLemmatizer.create_optimizer {#create_optimizer tag="method"}
+
+Create an optimizer for the pipeline component.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> optimizer = lemmatizer.create_optimizer()
+> ```
+
+| Name | Description |
+| ----------- | ---------------------------- |
+| **RETURNS** | The optimizer. ~~Optimizer~~ |
+
+## EditTreeLemmatizer.use_params {#use_params tag="method, contextmanager"}
+
+Modify the pipe's model, to use the given parameter values. At the end of the
+context, the original parameters are restored.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> with lemmatizer.use_params(optimizer.averages):
+> lemmatizer.to_disk("/best_model")
+> ```
+
+| Name | Description |
+| -------- | -------------------------------------------------- |
+| `params` | The parameter values to use in the model. ~~dict~~ |
+
+## EditTreeLemmatizer.to_disk {#to_disk tag="method"}
+
+Serialize the pipe to disk.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> lemmatizer.to_disk("/path/to/lemmatizer")
+> ```
+
+| Name | Description |
+| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
+| `path` | A path to a directory, which will be created if it doesn't exist. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
+| _keyword-only_ | |
+| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
+
+## EditTreeLemmatizer.from_disk {#from_disk tag="method"}
+
+Load the pipe from disk. Modifies the object in place and returns it.
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> lemmatizer.from_disk("/path/to/lemmatizer")
+> ```
+
+| Name | Description |
+| -------------- | ----------------------------------------------------------------------------------------------- |
+| `path` | A path to a directory. Paths may be either strings or `Path`-like objects. ~~Union[str, Path]~~ |
+| _keyword-only_ | |
+| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
+| **RETURNS** | The modified `EditTreeLemmatizer` object. ~~EditTreeLemmatizer~~ |
+
+## EditTreeLemmatizer.to_bytes {#to_bytes tag="method"}
+
+> #### Example
+>
+> ```python
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> lemmatizer_bytes = lemmatizer.to_bytes()
+> ```
+
+Serialize the pipe to a bytestring.
+
+| Name | Description |
+| -------------- | ------------------------------------------------------------------------------------------- |
+| _keyword-only_ | |
+| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
+| **RETURNS** | The serialized form of the `EditTreeLemmatizer` object. ~~bytes~~ |
+
+## EditTreeLemmatizer.from_bytes {#from_bytes tag="method"}
+
+Load the pipe from a bytestring. Modifies the object in place and returns it.
+
+> #### Example
+>
+> ```python
+> lemmatizer_bytes = lemmatizer.to_bytes()
+> lemmatizer = nlp.add_pipe("trainable_lemmatizer", name="lemmatizer")
+> lemmatizer.from_bytes(lemmatizer_bytes)
+> ```
+
+| Name | Description |
+| -------------- | ------------------------------------------------------------------------------------------- |
+| `bytes_data` | The data to load from. ~~bytes~~ |
+| _keyword-only_ | |
+| `exclude` | String names of [serialization fields](#serialization-fields) to exclude. ~~Iterable[str]~~ |
+| **RETURNS** | The `EditTreeLemmatizer` object. ~~EditTreeLemmatizer~~ |
+
+## EditTreeLemmatizer.labels {#labels tag="property"}
+
+The labels currently added to the component.
+
+
+
+The `EditTreeLemmatizer` labels are not useful by themselves, since they are
+identifiers of edit trees.
+
+
+
+| Name | Description |
+| ----------- | ------------------------------------------------------ |
+| **RETURNS** | The labels added to the component. ~~Tuple[str, ...]~~ |
+
+## EditTreeLemmatizer.label_data {#label_data tag="property" new="3"}
+
+The labels currently added to the component and their internal meta information.
+This is the data generated by [`init labels`](/api/cli#init-labels) and used by
+[`EditTreeLemmatizer.initialize`](/api/edittreelemmatizer#initialize) to
+initialize the model with a pre-defined label set.
+
+> #### Example
+>
+> ```python
+> labels = lemmatizer.label_data
+> lemmatizer.initialize(lambda: [], nlp=nlp, labels=labels)
+> ```
+
+| Name | Description |
+| ----------- | ---------------------------------------------------------- |
+| **RETURNS** | The label data added to the component. ~~Tuple[str, ...]~~ |
+
+## Serialization fields {#serialization-fields}
+
+During serialization, spaCy will export several data fields used to restore
+different aspects of the object. If needed, you can exclude them from
+serialization by passing in the string names via the `exclude` argument.
+
+> #### Example
+>
+> ```python
+> data = lemmatizer.to_disk("/path", exclude=["vocab"])
+> ```
+
+| Name | Description |
+| ------- | -------------------------------------------------------------- |
+| `vocab` | The shared [`Vocab`](/api/vocab). |
+| `cfg` | The config file. You usually don't want to exclude this. |
+| `model` | The binary model data. You usually don't want to exclude this. |
+| `trees` | The edit trees. You usually don't want to exclude this. |
diff --git a/website/docs/api/entitylinker.md b/website/docs/api/entitylinker.md
index 3d3372679..8e0d6087a 100644
--- a/website/docs/api/entitylinker.md
+++ b/website/docs/api/entitylinker.md
@@ -59,6 +59,7 @@ architectures and their arguments and hyperparameters.
| `incl_context` | Whether or not to include the local context in the model. Defaults to `True`. ~~bool~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [EntityLinker](/api/architectures#EntityLinker). ~~Model~~ |
| `entity_vector_length` | Size of encoding vectors in the KB. Defaults to `64`. ~~int~~ |
+| `use_gold_ents` | Whether to copy entities from the gold docs or not. Defaults to `True`. If `False`, entities must be set in the training data or by an annotating component in the pipeline. ~~int~~ |
| `get_candidates` | Function that generates plausible candidates for a given `Span` object. Defaults to [CandidateGenerator](/api/architectures#CandidateGenerator), a function looking up exact, case-dependent aliases in the KB. ~~Callable[[KnowledgeBase, Span], Iterable[Candidate]]~~ |
| `overwrite` 3.2 | Whether existing annotation is overwritten. Defaults to `True`. ~~bool~~ |
| `scorer` 3.2 | The scoring method. Defaults to [`Scorer.score_links`](/api/scorer#score_links). ~~Optional[Callable]~~ |
diff --git a/website/docs/api/entityrecognizer.md b/website/docs/api/entityrecognizer.md
index 14b6fece4..7c153f064 100644
--- a/website/docs/api/entityrecognizer.md
+++ b/website/docs/api/entityrecognizer.md
@@ -62,7 +62,7 @@ architectures and their arguments and hyperparameters.
| Setting | Description |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[List[str]]~~ |
+| `moves` | A list of transition names. Inferred from the data if not provided. Defaults to `None`. ~~Optional[TransitionSystem]~~ |
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. Defaults to [TransitionBasedParser](/api/architectures#TransitionBasedParser). ~~Model[List[Doc], List[Floats2d]]~~ |
| `incorrect_spans_key` | This key refers to a `SpanGroup` in `doc.spans` that specifies incorrect spans. The NER will learn not to predict (exactly) those spans. Defaults to `None`. ~~Optional[str]~~ |
@@ -98,7 +98,7 @@ shortcut for this and instantiate the component using its string name and
| `vocab` | The shared vocabulary. ~~Vocab~~ |
| `model` | The [`Model`](https://thinc.ai/docs/api-model) powering the pipeline component. ~~Model[List[Doc], List[Floats2d]]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
-| `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[List[str]]~~ |
+| `moves` | A list of transition names. Inferred from the data if set to `None`, which is the default. ~~Optional[TransitionSystem]~~ |
| _keyword-only_ | |
| `update_with_oracle_cut_size` | During training, cut long sequences into shorter segments by creating intermediate states based on the gold-standard history. The model is not very sensitive to this parameter, so you usually won't need to change it. Defaults to `100`. ~~int~~ |
| `incorrect_spans_key` | Identifies spans that are known to be incorrect entity annotations. The incorrect entity annotations can be stored in the span group in [`Doc.spans`](/api/doc#spans), under this key. Defaults to `None`. ~~Optional[str]~~ |
diff --git a/website/docs/api/legacy.md b/website/docs/api/legacy.md
index 916a5bf7f..e24c37d77 100644
--- a/website/docs/api/legacy.md
+++ b/website/docs/api/legacy.md
@@ -248,23 +248,6 @@ the others, but may not be as accurate, especially if texts are short.
## Loggers {#loggers}
-These functions are available from `@spacy.registry.loggers`.
+Logging utilities for spaCy are implemented in the [`spacy-loggers`](https://github.com/explosion/spacy-loggers) repo, and the functions are typically available from `@spacy.registry.loggers`.
-### spacy.WandbLogger.v1 {#WandbLogger_v1}
-
-The first version of the [`WandbLogger`](/api/top-level#WandbLogger) did not yet
-support the `log_dataset_dir` and `model_log_interval` arguments.
-
-> #### Example config
->
-> ```ini
-> [training.logger]
-> @loggers = "spacy.WandbLogger.v1"
-> project_name = "monitor_spacy_training"
-> remove_config_values = ["paths.train", "paths.dev", "corpora.train.path", "corpora.dev.path"]
-> ```
->
-> | Name | Description |
-> | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
-> | `project_name` | The name of the project in the Weights & Biases interface. The project will be created automatically if it doesn't exist yet. ~~str~~ |
-> | `remove_config_values` | A list of values to include from the config before it is uploaded to W&B (default: empty). ~~List[str]~~ |
+More documentation can be found in that repo's [readme](https://github.com/explosion/spacy-loggers/blob/main/README.md) file.
diff --git a/website/docs/api/lemmatizer.md b/website/docs/api/lemmatizer.md
index 2fa040917..75387305a 100644
--- a/website/docs/api/lemmatizer.md
+++ b/website/docs/api/lemmatizer.md
@@ -9,14 +9,15 @@ api_trainable: false
---
Component for assigning base forms to tokens using rules based on part-of-speech
-tags, or lookup tables. Functionality to train the component is coming soon.
-Different [`Language`](/api/language) subclasses can implement their own
-lemmatizer components via
+tags, or lookup tables. Different [`Language`](/api/language) subclasses can
+implement their own lemmatizer components via
[language-specific factories](/usage/processing-pipelines#factories-language).
The default data used is provided by the
[`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data)
extension package.
+For a trainable lemmatizer, see [`EditTreeLemmatizer`](/api/edittreelemmatizer).
+
As of v3.0, the `Lemmatizer` is a **standalone pipeline component** that can be
diff --git a/website/docs/api/matcher.md b/website/docs/api/matcher.md
index 3e7f9dc04..273c202ca 100644
--- a/website/docs/api/matcher.md
+++ b/website/docs/api/matcher.md
@@ -34,6 +34,7 @@ rule-based matching are:
| ----------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ |
+| `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ |
| ย `LENGTH` | The length of the token text. ~~int~~ |
| ย `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
diff --git a/website/docs/api/span.md b/website/docs/api/span.md
index ff7905bc0..d765a199c 100644
--- a/website/docs/api/span.md
+++ b/website/docs/api/span.md
@@ -283,8 +283,9 @@ objects, if the document has been syntactically parsed. A base noun phrase, or
it โ so no NP-level coordination, no prepositional phrases, and no relative
clauses.
-If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
-not been implemeted for the given language, a `NotImplementedError` is raised.
+If the `noun_chunk` [syntax iterator](/usage/linguistic-features#language-data)
+has not been implemeted for the given language, a `NotImplementedError` is
+raised.
> #### Example
>
@@ -520,12 +521,13 @@ sent = doc[sent.start : max(sent.end, span.end)]
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
-Returns a generator over the sentences the span belongs to. This property is only available
-when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
-document by the `parser`, `senter`, `sentencizer` or some custom function. It
-will raise an error otherwise.
+Returns a generator over the sentences the span belongs to. This property is
+only available when [sentence boundaries](/usage/linguistic-features#sbd) have
+been set on the document by the `parser`, `senter`, `sentencizer` or some custom
+function. It will raise an error otherwise.
-If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned.
+If the span happens to cross sentence boundaries, all sentences the span
+overlaps with will be returned.
> #### Example
>
diff --git a/website/docs/api/spancategorizer.md b/website/docs/api/spancategorizer.md
index 26fcaefdf..f09ac8bdb 100644
--- a/website/docs/api/spancategorizer.md
+++ b/website/docs/api/spancategorizer.md
@@ -56,7 +56,7 @@ architectures and their arguments and hyperparameters.
| -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to [`ngram_suggester`](#ngram_suggester). ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ |
| `model` | A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to [SpanCategorizer](/api/architectures#SpanCategorizer). ~~Model[Tuple[List[Doc], Ragged], Floats2d]~~ |
-| `spans_key` | Key of the [`Doc.spans`](/api/doc#spans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~ |
+| `spans_key` | Key of the [`Doc.spans`](/api/doc#spans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. ~~str~~ |
| `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ |
| `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ |
| `scorer` | The scoring method. Defaults to [`Scorer.score_spans`](/api/scorer#score_spans) for `Doc.spans[spans_key]` with overlapping spans allowed. ~~Optional[Callable]~~ |
@@ -93,7 +93,7 @@ shortcut for this and instantiate the component using its string name and
| `suggester` | A function that [suggests spans](#suggesters). Spans are returned as a ragged array with two integer columns, for the start and end positions. ~~Callable[[Iterable[Doc], Optional[Ops]], Ragged]~~ |
| `name` | String name of the component instance. Used to add entries to the `losses` during training. ~~str~~ |
| _keyword-only_ | |
-| `spans_key` | Key of the [`Doc.spans`](/api/doc#sans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"spans"`. ~~str~~ |
+| `spans_key` | Key of the [`Doc.spans`](/api/doc#sans) dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. ~~str~~ |
| `threshold` | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. ~~float~~ |
| `max_positive` | Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. ~~Optional[int]~~ |
@@ -239,6 +239,24 @@ Delegates to [`predict`](/api/spancategorizer#predict) and
| `losses` | Optional record of the loss during training. Updated using the component name as the key. ~~Optional[Dict[str, float]]~~ |
| **RETURNS** | The updated `losses` dictionary. ~~Dict[str, float]~~ |
+## SpanCategorizer.set_candidates {#set_candidates tag="method", new="3.3"}
+
+Use the suggester to add a list of [`Span`](/api/span) candidates to a list of
+[`Doc`](/api/doc) objects. This method is intended to be used for debugging
+purposes.
+
+> #### Example
+>
+> ```python
+> spancat = nlp.add_pipe("spancat")
+> spancat.set_candidates(docs, "candidates")
+> ```
+
+| Name | Description |
+| ---------------- | -------------------------------------------------------------------- |
+| `docs` | The documents to modify. ~~Iterable[Doc]~~ |
+| `candidates_key` | Key of the Doc.spans dict to save the candidate spans under. ~~str~~ |
+
## SpanCategorizer.get_loss {#get_loss tag="method"}
Find the loss and gradient of loss for the batch of documents and their
diff --git a/website/docs/api/spangroup.md b/website/docs/api/spangroup.md
index 654067eb1..1e2d18a82 100644
--- a/website/docs/api/spangroup.md
+++ b/website/docs/api/spangroup.md
@@ -21,7 +21,7 @@ Create a `SpanGroup`.
>
> ```python
> doc = nlp("Their goi ng home")
-> spans = [doc[0:1], doc[2:4]]
+> spans = [doc[0:1], doc[1:3]]
>
> # Construction 1
> from spacy.tokens import SpanGroup
@@ -60,7 +60,7 @@ the scope of your function.
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert doc.spans["errors"].doc == doc
> ```
@@ -76,9 +76,9 @@ Check whether the span group contains overlapping spans.
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert not doc.spans["errors"].has_overlap
-> doc.spans["errors"].append(doc[1:2])
+> doc.spans["errors"].append(doc[2:4])
> assert doc.spans["errors"].has_overlap
> ```
@@ -94,7 +94,7 @@ Get the number of spans in the group.
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> assert len(doc.spans["errors"]) == 2
> ```
@@ -104,15 +104,20 @@ Get the number of spans in the group.
## SpanGroup.\_\_getitem\_\_ {#getitem tag="method"}
-Get a span from the group.
+Get a span from the group. Note that a copy of the span is returned, so if any
+changes are made to this span, they are not reflected in the corresponding
+member of the span group. The item or group will need to be reassigned for
+changes to be reflected in the span group.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> span = doc.spans["errors"][1]
> assert span.text == "goi ng"
+> span.label_ = 'LABEL'
+> assert doc.spans["errors"][1].label_ != 'LABEL' # The span within the group was not updated
> ```
| Name | Description |
@@ -120,6 +125,83 @@ Get a span from the group.
| `i` | The item index. ~~int~~ |
| **RETURNS** | The span at the given index. ~~Span~~ |
+## SpanGroup.\_\_setitem\_\_ {#setitem tag="method", new="3.3"}
+
+Set a span in the span group.
+
+> #### Example
+>
+> ```python
+> doc = nlp("Their goi ng home")
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
+> span = doc[0:2]
+> doc.spans["errors"][0] = span
+> assert doc.spans["errors"][0].text == "Their goi"
+> ```
+
+| Name | Description |
+| ------ | ----------------------- |
+| `i` | The item index. ~~int~~ |
+| `span` | The new value. ~~Span~~ |
+
+## SpanGroup.\_\_delitem\_\_ {#delitem tag="method", new="3.3"}
+
+Delete a span from the span group.
+
+> #### Example
+>
+> ```python
+> doc = nlp("Their goi ng home")
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
+> del doc.spans[0]
+> assert len(doc.spans["errors"]) == 1
+> ```
+
+| Name | Description |
+| ---- | ----------------------- |
+| `i` | The item index. ~~int~~ |
+
+## SpanGroup.\_\_add\_\_ {#add tag="method", new="3.3"}
+
+Concatenate the current span group with another span group and return the result
+in a new span group. Any `attrs` from the first span group will have precedence
+over `attrs` in the second.
+
+> #### Example
+>
+> ```python
+> doc = nlp("Their goi ng home")
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
+> doc.spans["other"] = [doc[0:2], doc[2:4]]
+> span_group = doc.spans["errors"] + doc.spans["other"]
+> assert len(span_group) == 4
+> ```
+
+| Name | Description |
+| ----------- | ---------------------------------------------------------------------------- |
+| `other` | The span group or spans to concatenate. ~~Union[SpanGroup, Iterable[Span]]~~ |
+| **RETURNS** | The new span group. ~~SpanGroup~~ |
+
+## SpanGroup.\_\_iadd\_\_ {#iadd tag="method", new="3.3"}
+
+Append an iterable of spans or the content of a span group to the current span
+group. Any `attrs` in the other span group will be added for keys that are not
+already present in the current span group.
+
+> #### Example
+>
+> ```python
+> doc = nlp("Their goi ng home")
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
+> doc.spans["errors"] += [doc[3:4], doc[2:3]]
+> assert len(doc.spans["errors"]) == 4
+> ```
+
+| Name | Description |
+| ----------- | ----------------------------------------------------------------------- |
+| `other` | The span group or spans to append. ~~Union[SpanGroup, Iterable[Span]]~~ |
+| **RETURNS** | The span group. ~~SpanGroup~~ |
+
## SpanGroup.append {#append tag="method"}
Add a [`Span`](/api/span) object to the group. The span must refer to the same
@@ -130,7 +212,7 @@ Add a [`Span`](/api/span) object to the group. The span must refer to the same
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = [doc[0:1]]
-> doc.spans["errors"].append(doc[2:4])
+> doc.spans["errors"].append(doc[1:3])
> assert len(doc.spans["errors"]) == 2
> ```
@@ -140,21 +222,42 @@ Add a [`Span`](/api/span) object to the group. The span must refer to the same
## SpanGroup.extend {#extend tag="method"}
-Add multiple [`Span`](/api/span) objects to the group. All spans must refer to
-the same [`Doc`](/api/doc) object as the span group.
+Add multiple [`Span`](/api/span) objects or contents of another `SpanGroup` to
+the group. All spans must refer to the same [`Doc`](/api/doc) object as the span
+group.
> #### Example
>
> ```python
> doc = nlp("Their goi ng home")
> doc.spans["errors"] = []
-> doc.spans["errors"].extend([doc[2:4], doc[0:1]])
+> doc.spans["errors"].extend([doc[1:3], doc[0:1]])
> assert len(doc.spans["errors"]) == 2
+> span_group = SpanGroup([doc[1:4], doc[0:3])
+> doc.spans["errors"].extend(span_group)
> ```
-| Name | Description |
-| ------- | ------------------------------------ |
-| `spans` | The spans to add. ~~Iterable[Span]~~ |
+| Name | Description |
+| ------- | -------------------------------------------------------- |
+| `spans` | The spans to add. ~~Union[SpanGroup, Iterable["Span"]]~~ |
+
+## SpanGroup.copy {#copy tag="method", new="3.3"}
+
+Return a copy of the span group.
+
+> #### Example
+>
+> ```python
+> from spacy.tokens import SpanGroup
+>
+> doc = nlp("Their goi ng home")
+> doc.spans["errors"] = [doc[1:3], doc[0:3]]
+> new_group = doc.spans["errors"].copy()
+> ```
+
+| Name | Description |
+| ----------- | ----------------------------------------------- |
+| **RETURNS** | A copy of the `SpanGroup` object. ~~SpanGroup~~ |
## SpanGroup.to_bytes {#to_bytes tag="method"}
@@ -164,7 +267,7 @@ Serialize the span group to a bytestring.
>
> ```python
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> group_bytes = doc.spans["errors"].to_bytes()
> ```
@@ -183,7 +286,7 @@ it.
> from spacy.tokens import SpanGroup
>
> doc = nlp("Their goi ng home")
-> doc.spans["errors"] = [doc[0:1], doc[2:4]]
+> doc.spans["errors"] = [doc[0:1], doc[1:3]]
> group_bytes = doc.spans["errors"].to_bytes()
> new_group = SpanGroup()
> new_group.from_bytes(group_bytes)
diff --git a/website/docs/api/token.md b/website/docs/api/token.md
index 44a2ea9e8..3c3d12d54 100644
--- a/website/docs/api/token.md
+++ b/website/docs/api/token.md
@@ -349,23 +349,6 @@ A sequence containing the token and all the token's syntactic descendants.
| ---------- | ------------------------------------------------------------------------------------ |
| **YIELDS** | A descendant token such that `self.is_ancestor(token)` or `token == self`. ~~Token~~ |
-## Token.is_sent_start {#is_sent_start tag="property" new="2"}
-
-A boolean value indicating whether the token starts a sentence. `None` if
-unknown. Defaults to `True` for the first token in the `Doc`.
-
-> #### Example
->
-> ```python
-> doc = nlp("Give it back! He pleaded.")
-> assert doc[4].is_sent_start
-> assert not doc[5].is_sent_start
-> ```
-
-| Name | Description |
-| ----------- | ------------------------------------------------------- |
-| **RETURNS** | Whether the token starts a sentence. ~~Optional[bool]~~ |
-
## Token.has_vector {#has_vector tag="property" model="vectors"}
A boolean value indicating whether a word vector is associated with the token.
@@ -465,6 +448,8 @@ The L2 norm of the token's vector representation.
| `is_punct` | Is the token punctuation? ~~bool~~ |
| `is_left_punct` | Is the token a left punctuation mark, e.g. `"("` ? ~~bool~~ |
| `is_right_punct` | Is the token a right punctuation mark, e.g. `")"` ? ~~bool~~ |
+| `is_sent_start` | Does the token start a sentence? ~~bool~~ or `None` if unknown. Defaults to `True` for the first token in the `Doc`. |
+| `is_sent_end` | Does the token end a sentence? ~~bool~~ or `None` if unknown. |
| `is_space` | Does the token consist of whitespace characters? Equivalent to `token.text.isspace()`. ~~bool~~ |
| `is_bracket` | Is the token a bracket? ~~bool~~ |
| `is_quote` | Is the token a quotation mark? ~~bool~~ |
diff --git a/website/docs/api/tokenizer.md b/website/docs/api/tokenizer.md
index 8809c10bc..6eb7e8024 100644
--- a/website/docs/api/tokenizer.md
+++ b/website/docs/api/tokenizer.md
@@ -44,15 +44,16 @@ how to construct a custom tokenizer with different tokenization rules, see the
> tokenizer = nlp.tokenizer
> ```
-| Name | Description |
-| ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `vocab` | A storage container for lexical types. ~~Vocab~~ |
-| `rules` | Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
-| `prefix_search` | A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
-| `suffix_search` | A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
-| `infix_finditer` | A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
-| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~ |
-| `url_match` | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
+| Name | Description |
+| -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `vocab` | A storage container for lexical types. ~~Vocab~~ |
+| `rules` | Exceptions and special-cases for the tokenizer. ~~Optional[Dict[str, List[Dict[int, str]]]]~~ |
+| `prefix_search` | A function matching the signature of `re.compile(string).search` to match prefixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
+| `suffix_search` | A function matching the signature of `re.compile(string).search` to match suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
+| `infix_finditer` | A function matching the signature of `re.compile(string).finditer` to find infixes. ~~Optional[Callable[[str], Iterator[Match]]]~~ |
+| `token_match` | A function matching the signature of `re.compile(string).match` to find token matches. ~~Optional[Callable[[str], Optional[Match]]]~~ |
+| `url_match` | A function matching the signature of `re.compile(string).match` to find token matches after considering prefixes and suffixes. ~~Optional[Callable[[str], Optional[Match]]]~~ |
+| `faster_heuristics` 3.3.0 | Whether to restrict the final `Matcher`-based pass for rules to those containing affixes or space. Defaults to `True`. ~~bool~~ |
## Tokenizer.\_\_call\_\_ {#call tag="method"}
diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md
index be19f9c3a..f2fd1415f 100644
--- a/website/docs/api/top-level.md
+++ b/website/docs/api/top-level.md
@@ -263,7 +263,7 @@ Render a dependency parse tree or named entity visualization.
| Name | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span]], Doc, Span]~~ |
+| `docs` | Document(s) or span(s) to visualize. ~~Union[Iterable[Union[Doc, Span, dict]], Doc, Span, dict]~~ |
| `style` | Visualization style, `"dep"` or `"ent"`. Defaults to `"dep"`. ~~str~~ |
| `page` | Render markup as full HTML page. Defaults to `True`. ~~bool~~ |
| `minify` | Minify HTML markup. Defaults to `False`. ~~bool~~ |
@@ -320,12 +320,31 @@ If a setting is not present in the options, the default value will be used.
| `template` 2.2 | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](%%GITHUB_SPACY/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ |
| `kb_url_template` 3.2.1 | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. ~~Optional[str]~~ |
-By default, displaCy comes with colors for all entity types used by
-[spaCy's trained pipelines](/models). If you're using custom entity types, you
-can use the `colors` setting to add your own colors for them. Your application
-or pipeline package can also expose a
-[`spacy_displacy_colors` entry point](/usage/saving-loading#entry-points-displacy)
-to add custom labels and their colors automatically.
+
+#### Span Visualizer options {#displacy_options-span}
+
+> #### Example
+>
+> ```python
+> options = {"spans_key": "sc"}
+> displacy.serve(doc, style="span", options=options)
+> ```
+
+| Name | Description |
+|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
+| `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ |
+| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
+| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
+
+
+By default, displaCy comes with colors for all entity types used by [spaCy's
+trained pipelines](/models) for both entity and span visualizer. If you're
+using custom entity types, you can use the `colors` setting to add your own
+colors for them. Your application or pipeline package can also expose a
+[`spacy_displacy_colors` entry
+point](/usage/saving-loading#entry-points-displacy) to add custom labels and
+their colors automatically.
By default, displaCy links to `#` for entities without a `kb_id` set on their
span. If you wish to link an entity to their URL then consider using the
@@ -335,6 +354,7 @@ span. If you wish to link an entity to their URL then consider using the
should redirect you to their Wikidata page, in this case
`https://www.wikidata.org/wiki/Q95`.
+
## registry {#registry source="spacy/util.py" new="3"}
spaCy's function registry extends
@@ -423,7 +443,7 @@ and the accuracy scores on the development set.
The built-in, default logger is the ConsoleLogger, which prints results to the
console in tabular format. The
[spacy-loggers](https://github.com/explosion/spacy-loggers) package, included as
-a dependency of spaCy, enables other loggers: currently it provides one that
+a dependency of spaCy, enables other loggers, such as one that
sends results to a [Weights & Biases](https://www.wandb.com/) dashboard.
Instead of using one of the built-in loggers, you can
diff --git a/website/docs/api/vectors.md b/website/docs/api/vectors.md
index b3bee822c..9636ea04c 100644
--- a/website/docs/api/vectors.md
+++ b/website/docs/api/vectors.md
@@ -327,9 +327,9 @@ will be counted individually. In `floret` mode, the keys table is not used.
> assert vectors.n_keys == 0
> ```
-| Name | Description |
-| ----------- | -------------------------------------------- |
-| **RETURNS** | The number of all keys in the table. ~~int~~ |
+| Name | Description |
+| ----------- | ----------------------------------------------------------------------------- |
+| **RETURNS** | The number of all keys in the table. Returns `-1` for floret vectors. ~~int~~ |
## Vectors.most_similar {#most_similar tag="method"}
@@ -347,14 +347,14 @@ supported for `floret` mode.
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
> ```
-| Name | Description |
-| -------------- | --------------------------------------------------------------------------- |
-| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
-| _keyword-only_ | |
-| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
-| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
-| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
-| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
+| Name | Description |
+| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
+| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
+| _keyword-only_ | |
+| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
+| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
+| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
+| **RETURNS** | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
## Vectors.get_batch {#get_batch tag="method" new="3.2"}
@@ -385,7 +385,7 @@ Change the embedding matrix to use different Thinc ops.
> ```
| Name | Description |
-|-------|----------------------------------------------------------|
+| ----- | -------------------------------------------------------- |
| `ops` | The Thinc ops to switch the embedding matrix to. ~~Ops~~ |
## Vectors.to_disk {#to_disk tag="method"}
diff --git a/website/docs/api/vocab.md b/website/docs/api/vocab.md
index c0a269d95..2e4a206ec 100644
--- a/website/docs/api/vocab.md
+++ b/website/docs/api/vocab.md
@@ -156,7 +156,7 @@ cosines are calculated in minibatches to reduce memory usage.
>
> ```python
> nlp.vocab.prune_vectors(10000)
-> assert len(nlp.vocab.vectors) <= 1000
+> assert len(nlp.vocab.vectors) <= 10000
> ```
| Name | Description |
@@ -165,26 +165,34 @@ cosines are calculated in minibatches to reduce memory usage.
| `batch_size` | Batch of vectors for calculating the similarities. Larger batch sizes might be faster, while temporarily requiring more memory. ~~int~~ |
| **RETURNS** | A dictionary keyed by removed words mapped to `(string, score)` tuples, where `string` is the entry the removed word was mapped to, and `score` the similarity score between the two words. ~~Dict[str, Tuple[str, float]]~~ |
+## Vocab.deduplicate_vectors {#deduplicate_vectors tag="method" new="3.3"}
+
+> #### Example
+>
+> ```python
+> nlp.vocab.deduplicate_vectors()
+> ```
+
+Remove any duplicate rows from the current vector table, maintaining the
+mappings for all words in the vectors.
+
## Vocab.get_vector {#get_vector tag="method" new="2"}
Retrieve a vector for a word in the vocabulary. Words can be looked up by string
-or hash value. If no vectors data is loaded, a `ValueError` is raised. If `minn`
-is defined, then the resulting vector uses [FastText](https://fasttext.cc/)'s
-subword features by average over n-grams of `orth` (introduced in spaCy `v2.1`).
+or hash value. If the current vectors do not contain an entry for the word, a
+0-vector with the same number of dimensions
+([`Vocab.vectors_length`](#attributes)) as the current vectors is returned.
> #### Example
>
> ```python
> nlp.vocab.get_vector("apple")
-> nlp.vocab.get_vector("apple", minn=1, maxn=5)
> ```
-| Name | Description |
-| ----------------------------------- | ---------------------------------------------------------------------------------------------------------------------- |
-| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
-| `minn` 2.1 | Minimum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
-| `maxn` 2.1 | Maximum n-gram length used for FastText's n-gram computation. Defaults to the length of `orth`. ~~int~~ |
-| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
+| Name | Description |
+| ----------- | ---------------------------------------------------------------------------------------------------------------------- |
+| `orth` | The hash value of a word, or its unicode string. ~~Union[int, str]~~ |
+| **RETURNS** | A word vector. Size and shape are determined by the `Vocab.vectors` instance. ~~numpy.ndarray[ndim=1, dtype=float32]~~ |
## Vocab.set_vector {#set_vector tag="method" new="2"}
diff --git a/website/docs/images/displacy-span-custom.html b/website/docs/images/displacy-span-custom.html
new file mode 100644
index 000000000..97dd3b140
--- /dev/null
+++ b/website/docs/images/displacy-span-custom.html
@@ -0,0 +1,31 @@
+
+ Welcome to the
+
+ Bank
+
+
+
+
+ BANK
+
+
+
+
+ of
+
+
+
+
+ China
+
+
+
+
+ .
+
\ No newline at end of file
diff --git a/website/docs/images/displacy-span.html b/website/docs/images/displacy-span.html
new file mode 100644
index 000000000..9bbc6403c
--- /dev/null
+++ b/website/docs/images/displacy-span.html
@@ -0,0 +1,41 @@
+
+ Welcome to the
+
+ Bank
+
+
+
+
+ ORG
+
+
+
+
+ of
+
+
+
+
+
+ China
+
+
+
+
+
+
+ GPE
+
+
+
+ .
+
\ No newline at end of file
diff --git a/website/docs/images/pipeline-design.svg b/website/docs/images/pipeline-design.svg
index 88ccdab99..3b528eae5 100644
--- a/website/docs/images/pipeline-design.svg
+++ b/website/docs/images/pipeline-design.svg
@@ -1,49 +1,56 @@
-
+### Korean language support {#korean}
+
+> #### mecab-ko tokenizer
+>
+> ```python
+> nlp = spacy.blank("ko")
+> ```
+
+The default MeCab-based Korean tokenizer requires:
+
+- [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md)
+- [mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic)
+- [natto-py](https://github.com/buruzaemon/natto-py)
+
+For some Korean datasets and tasks, the
+[rule-based tokenizer](/usage/linguistic-features#tokenization) is better-suited
+than MeCab. To configure a Korean pipeline with the rule-based tokenizer:
+
+> #### Rule-based tokenizer
+>
+> ```python
+> config = {"nlp": {"tokenizer": {"@tokenizers": "spacy.Tokenizer.v1"}}}
+> nlp = spacy.blank("ko", config=config)
+> ```
+
+```ini
+### config.cfg
+[nlp]
+lang = "ko"
+tokenizer = {"@tokenizers" = "spacy.Tokenizer.v1"}
+```
+
+
+
+The [Korean trained pipelines](/models/ko) use the rule-based tokenizer, so no
+additional dependencies are required.
+
+
+
## Installing and using trained pipelines {#download}
The easiest way to download a trained pipeline is via spaCy's
@@ -417,10 +463,10 @@ doc = nlp("This is a sentence.")
You can use the [`info`](/api/cli#info) command or
-[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline
-package's meta data before loading it. Each `Language` object with a loaded
-pipeline also exposes the pipeline's meta data as the attribute `meta`. For
-example, `nlp.meta['version']` will return the package version.
+[`spacy.info()`](/api/top-level#spacy.info) method to print a pipeline package's
+meta data before loading it. Each `Language` object with a loaded pipeline also
+exposes the pipeline's meta data as the attribute `meta`. For example,
+`nlp.meta['version']` will return the package version.
diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md
index 11fd1459d..4f75b5193 100644
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@@ -303,22 +303,23 @@ available pipeline components and component functions.
> ruler = nlp.add_pipe("entity_ruler")
> ```
-| String name | Component | Description |
-| -------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
-| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
-| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
-| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
-| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
-| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
-| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories: exactly one category is predicted per document. |
-| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document. |
-| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words. |
-| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
-| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
-| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
-| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
-| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
-| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
+| String name | Component | Description |
+| ---------------------- | ---------------------------------------------------- | ----------------------------------------------------------------------------------------- |
+| `tagger` | [`Tagger`](/api/tagger) | Assign part-of-speech-tags. |
+| `parser` | [`DependencyParser`](/api/dependencyparser) | Assign dependency labels. |
+| `ner` | [`EntityRecognizer`](/api/entityrecognizer) | Assign named entities. |
+| `entity_linker` | [`EntityLinker`](/api/entitylinker) | Assign knowledge base IDs to named entities. Should be added after the entity recognizer. |
+| `entity_ruler` | [`EntityRuler`](/api/entityruler) | Assign named entities based on pattern rules and dictionaries. |
+| `textcat` | [`TextCategorizer`](/api/textcategorizer) | Assign text categories: exactly one category is predicted per document. |
+| `textcat_multilabel` | [`MultiLabel_TextCategorizer`](/api/textcategorizer) | Assign text categories in a multi-label setting: zero, one or more labels per document. |
+| `lemmatizer` | [`Lemmatizer`](/api/lemmatizer) | Assign base forms to words using rules and lookups. |
+| `trainable_lemmatizer` | [`EditTreeLemmatizer`](/api/edittreelemmatizer) | Assign base forms to words. |
+| `morphologizer` | [`Morphologizer`](/api/morphologizer) | Assign morphological features and coarse-grained POS tags. |
+| `attribute_ruler` | [`AttributeRuler`](/api/attributeruler) | Assign token attribute mappings and rule-based exceptions. |
+| `senter` | [`SentenceRecognizer`](/api/sentencerecognizer) | Assign sentence boundaries. |
+| `sentencizer` | [`Sentencizer`](/api/sentencizer) | Add rule-based sentence segmentation without the dependency parse. |
+| `tok2vec` | [`Tok2Vec`](/api/tok2vec) | Assign token-to-vector embeddings. |
+| `transformer` | [`Transformer`](/api/transformer) | Assign the tokens and outputs of a transformer model. |
### Disabling, excluding and modifying components {#disabling}
@@ -1081,13 +1082,17 @@ on [serialization methods](/usage/saving-loading/#serialization-methods).
> directory.
```python
-### Custom serialization methods {highlight="6-7,9-11"}
+### Custom serialization methods {highlight="7-11,13-15"}
import srsly
+from spacy.util import ensure_path
class AcronymComponent:
# other methods here...
def to_disk(self, path, exclude=tuple()):
+ path = ensure_path(path)
+ if not path.exists():
+ path.mkdir()
srsly.write_json(path / "data.json", self.data)
def from_disk(self, path, exclude=tuple()):
diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md
index e0e787a1d..57d226913 100644
--- a/website/docs/usage/projects.md
+++ b/website/docs/usage/projects.md
@@ -213,6 +213,12 @@ format, train a pipeline, evaluate it and export metrics, package it and spin up
a quick web demo. It looks pretty similar to a config file used to define CI
pipelines.
+> #### Tip: Multi-line YAML syntax for long values
+>
+> YAML has [multi-line syntax](https://yaml-multiline.info/) that can be
+> helpful for readability with longer values such as project descriptions or
+> commands that take several arguments.
+
```yaml
%%GITHUB_PROJECTS/pipelines/tagger_parser_ud/project.yml
```
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index 74bb10304..be9a56dc8 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -162,6 +162,7 @@ rule-based matching are:
| ----------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `ORTH` | The exact verbatim text of a token. ~~str~~ |
| `TEXT` 2.1 | The exact verbatim text of a token. ~~str~~ |
+| `NORM` | The normalized form of the token text. ~~str~~ |
| `LOWER` | The lowercase form of the token text. ~~str~~ |
| ย `LENGTH` | The length of the token text. ~~int~~ |
| ย `IS_ALPHA`, `IS_ASCII`, `IS_DIGIT` | Token text consists of alphabetic characters, ASCII characters, digits. ~~bool~~ |
@@ -948,7 +949,7 @@ for match_id, start, end in matcher(doc):
The examples here use [`nlp.make_doc`](/api/language#make_doc) to create `Doc`
object patterns as efficiently as possible and without running any of the other
-pipeline components. If the token attribute you want to match on are set by a
+pipeline components. If the token attribute you want to match on is set by a
pipeline component, **make sure that the pipeline component runs** when you
create the pattern. For example, to match on `POS` or `LEMMA`, the pattern `Doc`
objects need to have part-of-speech tags set by the `tagger` or `morphologizer`.
@@ -959,9 +960,9 @@ disable components selectively.
Another possible use case is matching number tokens like IP addresses based on
-their shape. This means that you won't have to worry about how those string will
-be tokenized and you'll be able to find tokens and combinations of tokens based
-on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
+their shape. This means that you won't have to worry about how those strings
+will be tokenized and you'll be able to find tokens and combinations of tokens
+based on a few examples. Here, we're matching on the shapes `ddd.d.d.d` and
`ddd.ddd.d.d`:
```python
@@ -1432,7 +1433,7 @@ of `"phrase_matcher_attr": "POS"` for the entity ruler.
Running the full language pipeline across every pattern in a large list scales
linearly and can therefore take a long time on large amounts of phrase patterns.
As of spaCy v2.2.4 the `add_patterns` function has been refactored to use
-nlp.pipe on all phrase patterns resulting in about a 10x-20x speed up with
+`nlp.pipe` on all phrase patterns resulting in about a 10x-20x speed up with
5,000-100,000 phrase patterns respectively. Even with this speedup (but
especially if you're using an older version) the `add_patterns` function can
still take a long time. An easy workaround to make this function run faster is
diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md
index 9dad077e7..af140e7a7 100644
--- a/website/docs/usage/saving-loading.md
+++ b/website/docs/usage/saving-loading.md
@@ -202,7 +202,9 @@ the data to and from a JSON file.
> rules _with_ the component data.
```python
-### {highlight="14-18,20-25"}
+### {highlight="16-23,25-30"}
+from spacy.util import ensure_path
+
@Language.factory("my_component")
class CustomComponent:
def __init__(self):
@@ -218,6 +220,9 @@ class CustomComponent:
def to_disk(self, path, exclude=tuple()):
# This will receive the directory path + /my_component
+ path = ensure_path(path)
+ if not path.exists():
+ path.mkdir()
data_path = path / "data.json"
with data_path.open("w", encoding="utf8") as f:
f.write(json.dumps(self.data))
@@ -467,7 +472,12 @@ pipeline package. When you save out a pipeline using `nlp.to_disk` and the
component exposes a `to_disk` method, it will be called with the disk path.
```python
+from spacy.util import ensure_path
+
def to_disk(self, path, exclude=tuple()):
+ path = ensure_path(path)
+ if not path.exists():
+ path.mkdir()
snek_path = path / "snek.txt"
with snek_path.open("w", encoding="utf8") as snek_file:
snek_file.write(self.snek)
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index f46f0052b..5e064b269 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -247,7 +247,7 @@ a consistent format. There are no command-line arguments that need to be set,
and no hidden defaults. However, there can still be scenarios where you may want
to override config settings when you run [`spacy train`](/api/cli#train). This
includes **file paths** to vectors or other resources that shouldn't be
-hard-code in a config file, or **system-dependent settings**.
+hard-coded in a config file, or **system-dependent settings**.
For cases like this, you can set additional command-line options starting with
`--` that correspond to the config section and value to override. For example,
@@ -730,7 +730,7 @@ with the name of the respective [registry](/api/top-level#registry), e.g.
`@spacy.registry.architectures`, and a string name to assign to your function.
Registering custom functions allows you to **plug in models** defined in PyTorch
or TensorFlow, make **custom modifications** to the `nlp` object, create custom
-optimizers or schedules, or **stream in data** and preprocesses it on the fly
+optimizers or schedules, or **stream in data** and preprocess it on the fly
while training.
Each custom function can have any number of arguments that are passed in via the
diff --git a/website/docs/usage/v3-3.md b/website/docs/usage/v3-3.md
new file mode 100644
index 000000000..739e2a2f9
--- /dev/null
+++ b/website/docs/usage/v3-3.md
@@ -0,0 +1,247 @@
+---
+title: What's New in v3.3
+teaser: New features and how to upgrade
+menu:
+ - ['New Features', 'features']
+ - ['Upgrading Notes', 'upgrading']
+---
+
+## New features {#features hidden="true"}
+
+spaCy v3.3 improves the speed of core pipeline components, adds a new trainable
+lemmatizer, and introduces trained pipelines for Finnish, Korean and Swedish.
+
+### Speed improvements {#speed}
+
+v3.3 includes a slew of speed improvements:
+
+- Speed up parser and NER by using constant-time head lookups.
+- Support unnormalized softmax probabilities in `spacy.Tagger.v2` to speed up
+ inference for tagger, morphologizer, senter and trainable lemmatizer.
+- Speed up parser projectivization functions.
+- Replace `Ragged` with faster `AlignmentArray` in `Example` for training.
+- Improve `Matcher` speed.
+- Improve serialization speed for empty `Doc.spans`.
+
+For longer texts, the trained pipeline speeds improve **15%** or more in
+prediction. We benchmarked `en_core_web_md` (same components as in v3.2) and
+`de_core_news_md` (with the new trainable lemmatizer) across a range of text
+sizes on Linux (Intel Xeon W-2265) and OS X (M1) to compare spaCy v3.2 vs. v3.3:
+
+**Intel Xeon W-2265**
+
+| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
+| :----------------------------------------------- | -------------: | -------------: | -------------: | -----: |
+| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 17292 | 17441 | 0.86% |
+| (=same components) | 1000 | 15408 | 16024 | 4.00% |
+| | 10000 | 12798 | 15346 | 19.91% |
+| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 20221 | 19321 | -4.45% |
+| (+v3.3 trainable lemmatizer) | 1000 | 17480 | 17345 | -0.77% |
+| | 10000 | 14513 | 17036 | 17.38% |
+
+**Apple M1**
+
+| Model | Avg. Words/Doc | v3.2 Words/Sec | v3.3 Words/Sec | Diff |
+| ------------------------------------------------ | -------------: | -------------: | -------------: | -----: |
+| [`en_core_web_md`](/models/en#en_core_web_md) | 100 | 18272 | 18408 | 0.74% |
+| (=same components) | 1000 | 18794 | 19248 | 2.42% |
+| | 10000 | 15144 | 17513 | 15.64% |
+| [`de_core_news_md`](/models/de/#de_core_news_md) | 100 | 19227 | 19591 | 1.89% |
+| (+v3.3 trainable lemmatizer) | 1000 | 20047 | 20628 | 2.90% |
+| | 10000 | 15921 | 18546 | 16.49% |
+
+### Trainable lemmatizer {#trainable-lemmatizer}
+
+The new [trainable lemmatizer](/api/edittreelemmatizer) component uses
+[edit trees](https://explosion.ai/blog/edit-tree-lemmatizer) to transform tokens
+into lemmas. Try out the trainable lemmatizer with the
+[training quickstart](/usage/training#quickstart)!
+
+### displaCy support for overlapping spans and arcs {#displacy}
+
+displaCy now supports overlapping spans with a new
+[`span`](/usage/visualizers#span) style and multiple arcs with different labels
+between the same tokens for [`dep`](/usage/visualizers#dep) visualizations.
+
+Overlapping spans can be visualized for any spans key in `doc.spans`:
+
+```python
+import spacy
+from spacy import displacy
+from spacy.tokens import Span
+
+nlp = spacy.blank("en")
+text = "Welcome to the Bank of China."
+doc = nlp(text)
+doc.spans["custom"] = [Span(doc, 3, 6, "ORG"), Span(doc, 5, 6, "GPE")]
+displacy.serve(doc, style="span", options={"spans_key": "custom"})
+```
+
+import DisplacySpanHtml from 'images/displacy-span.html'
+
+
+
+## Additional features and improvements
+
+- Config comparisons with [`spacy debug diff-config`](/api/cli#debug-diff).
+- Span suggester debugging with
+ [`SpanCategorizer.set_candidates`](/api/spancategorizer#set_candidates).
+- Big endian support with
+ [`thinc-bigendian-ops`](https://github.com/andrewsi-z/thinc-bigendian-ops) and
+ updates to make `floret`, `murmurhash`, Thinc and spaCy endian neutral.
+- Initial support for Lower Sorbian and Upper Sorbian.
+- Language updates for English, French, Italian, Japanese, Korean, Norwegian,
+ Russian, Slovenian, Spanish, Turkish, Ukrainian and Vietnamese.
+- New noun chunks for Finnish.
+
+## Trained pipelines {#pipelines}
+
+### New trained pipelines {#new-pipelines}
+
+v3.3 introduces new CPU/CNN pipelines for Finnish, Korean and Swedish, which use
+the new trainable lemmatizer and
+[floret vectors](https://github.com/explosion/floret). Due to the use
+[Bloom embeddings](https://explosion.ai/blog/bloom-embeddings) and subwords, the
+pipelines have compact vectors with no out-of-vocabulary words.
+
+| Package | Language | UPOS | Parser LAS | NER F |
+| ----------------------------------------------- | -------- | ---: | ---------: | ----: |
+| [`fi_core_news_sm`](/models/fi#fi_core_news_sm) | Finnish | 92.5 | 71.9 | 75.9 |
+| [`fi_core_news_md`](/models/fi#fi_core_news_md) | Finnish | 95.9 | 78.6 | 80.6 |
+| [`fi_core_news_lg`](/models/fi#fi_core_news_lg) | Finnish | 96.2 | 79.4 | 82.4 |
+| [`ko_core_news_sm`](/models/ko#ko_core_news_sm) | Korean | 86.1 | 65.6 | 71.3 |
+| [`ko_core_news_md`](/models/ko#ko_core_news_md) | Korean | 94.7 | 80.9 | 83.1 |
+| [`ko_core_news_lg`](/models/ko#ko_core_news_lg) | Korean | 94.7 | 81.3 | 85.3 |
+| [`sv_core_news_sm`](/models/sv#sv_core_news_sm) | Swedish | 95.0 | 75.9 | 74.7 |
+| [`sv_core_news_md`](/models/sv#sv_core_news_md) | Swedish | 96.3 | 78.5 | 79.3 |
+| [`sv_core_news_lg`](/models/sv#sv_core_news_lg) | Swedish | 96.3 | 79.1 | 81.1 |
+
+### Pipeline updates {#pipeline-updates}
+
+The following languages switch from lookup or rule-based lemmatizers to the new
+trainable lemmatizer: Danish, Dutch, German, Greek, Italian, Lithuanian,
+Norwegian, Polish, Portuguese and Romanian. The overall lemmatizer accuracy
+improves for all of these pipelines, but be aware that the types of errors may
+look quite different from the lookup-based lemmatizers. If you'd prefer to
+continue using the previous lemmatizer, you can
+[switch from the trainable lemmatizer to a non-trainable lemmatizer](/models#design-modify).
+
+
+
+In addition, the vectors in the English pipelines are deduplicated to improve
+the pruned vectors in the `md` models and reduce the `lg` model size.
+
+## Notes about upgrading from v3.2 {#upgrading}
+
+### Span comparisons
+
+Span comparisons involving ordering (`<`, `<=`, `>`, `>=`) now take all span
+attributes into account (start, end, label, and KB ID) so spans may be sorted in
+a slightly different order.
+
+### Whitespace annotation
+
+During training, annotation on whitespace tokens is handled in the same way as
+annotation on non-whitespace tokens in order to allow custom whitespace
+annotation.
+
+### Doc.from_docs
+
+[`Doc.from_docs`](/api/doc#from_docs) now includes `Doc.tensor` by default and
+supports excludes with an `exclude` argument in the same format as
+`Doc.to_bytes`. The supported exclude fields are `spans`, `tensor` and
+`user_data`.
+
+Docs including `Doc.tensor` may be quite a bit larger in RAM, so to exclude
+`Doc.tensor` as in v3.2:
+
+```diff
+-merged_doc = Doc.from_docs(docs)
++merged_doc = Doc.from_docs(docs, exclude=["tensor"])
+```
+
+### Using trained pipelines with floret vectors
+
+If you're running a new trained pipeline for Finnish, Korean or Swedish on new
+texts and working with `Doc` objects, you shouldn't notice any difference with
+floret vectors vs. default vectors.
+
+If you use vectors for similarity comparisons, there are a few differences,
+mainly because a floret pipeline doesn't include any kind of frequency-based
+word list similar to the list of in-vocabulary vector keys with default vectors.
+
+- If your workflow iterates over the vector keys, you should use an external
+ word list instead:
+
+ ```diff
+ - lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
+ + lexemes = [nlp.vocab[word] for word in external_word_list]
+ ```
+
+- `Vectors.most_similar` is not supported because there's no fixed list of
+ vectors to compare your vectors to.
+
+### Pipeline package version compatibility {#version-compat}
+
+> #### Using legacy implementations
+>
+> In spaCy v3, you'll still be able to load and reference legacy implementations
+> via [`spacy-legacy`](https://github.com/explosion/spacy-legacy), even if the
+> components or architectures change and newer versions are available in the
+> core library.
+
+When you're loading a pipeline package trained with an earlier version of spaCy
+v3, you will see a warning telling you that the pipeline may be incompatible.
+This doesn't necessarily have to be true, but we recommend running your
+pipelines against your test suite or evaluation data to make sure there are no
+unexpected results.
+
+If you're using one of the [trained pipelines](/models) we provide, you should
+run [`spacy download`](/api/cli#download) to update to the latest version. To
+see an overview of all installed packages and their compatibility, you can run
+[`spacy validate`](/api/cli#validate).
+
+If you've trained your own custom pipeline and you've confirmed that it's still
+working as expected, you can update the spaCy version requirements in the
+[`meta.json`](/api/data-formats#meta):
+
+```diff
+- "spacy_version": ">=3.2.0,<3.3.0",
++ "spacy_version": ">=3.2.0,<3.4.0",
+```
+
+### Updating v3.2 configs
+
+To update a config from spaCy v3.2 with the new v3.3 settings, run
+[`init fill-config`](/api/cli#init-fill-config):
+
+```cli
+$ python -m spacy init fill-config config-v3.2.cfg config-v3.3.cfg
+```
+
+In many cases ([`spacy train`](/api/cli#train),
+[`spacy.load`](/api/top-level#spacy.load)), the new defaults will be filled in
+automatically, but you'll need to fill in the new settings to run
+[`debug config`](/api/cli#debug) and [`debug data`](/api/cli#debug-data).
+
+To see the speed improvements for the
+[`Tagger` architecture](/api/architectures#Tagger), edit your config to switch
+from `spacy.Tagger.v1` to `spacy.Tagger.v2` and then run `init fill-config`.
diff --git a/website/docs/usage/visualizers.md b/website/docs/usage/visualizers.md
index 072718f91..d2892b863 100644
--- a/website/docs/usage/visualizers.md
+++ b/website/docs/usage/visualizers.md
@@ -5,6 +5,7 @@ new: 2
menu:
- ['Dependencies', 'dep']
- ['Named Entities', 'ent']
+ - ['Spans', 'span']
- ['Jupyter Notebooks', 'jupyter']
- ['Rendering HTML', 'html']
- ['Web app usage', 'webapp']
@@ -167,6 +168,59 @@ This feature is especially handy if you're using displaCy to compare performance
at different stages of a process, e.g. during training. Here you could use the
title for a brief description of the text example and the number of iterations.
+## Visualizing spans {#span}
+
+The span visualizer, `span`, highlights overlapping spans in a text.
+
+```python
+### Span example
+import spacy
+from spacy import displacy
+from spacy.tokens import Span
+
+text = "Welcome to the Bank of China."
+
+nlp = spacy.blank("en")
+doc = nlp(text)
+
+doc.spans["sc"] = [
+ Span(doc, 3, 6, "ORG"),
+ Span(doc, 5, 6, "GPE"),
+]
+
+displacy.serve(doc, style="span")
+```
+
+import DisplacySpanHtml from 'images/displacy-span.html'
+
+
+
+
+The span visualizer lets you customize the following `options`:
+
+| Argument | Description |
+|-----------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `spans_key` | Which spans key to render spans from. Default is `"sc"`. ~~str~~ |
+| `templates` | Dictionary containing the keys `"span"`, `"slice"`, and `"start"`. These dictate how the overall span, a span slice, and the starting token will be rendered. ~~Optional[Dict[str, str]~~ |
+| `kb_url_template` | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in ~~Optional[str]~~ |
+| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ |
+
+Because spans can be stored across different keys in `doc.spans`, you need to specify
+which one displaCy should use with `spans_key` (`sc` is the default).
+
+> #### Options example
+>
+> ```python
+> doc.spans["custom"] = [Span(doc, 3, 6, "BANK")]
+> options = {"spans_key": "custom"}
+> displacy.serve(doc, style="span", options=options)
+
+import DisplacySpanCustomHtml from 'images/displacy-span-custom.html'
+
+
+
+
+
## Using displaCy in Jupyter notebooks {#jupyter}
displaCy is able to detect whether you're working in a
@@ -289,9 +343,7 @@ want to visualize output from other libraries, like [NLTK](http://www.nltk.org)
or
[SyntaxNet](https://github.com/tensorflow/models/tree/master/research/syntaxnet).
If you set `manual=True` on either `render()` or `serve()`, you can pass in data
-in displaCy's format (instead of `Doc` objects). When setting `ents` manually,
-make sure to supply them in the right order, i.e. starting with the lowest start
-position.
+in displaCy's format as a dictionary (instead of `Doc` objects).
> #### Example
>
diff --git a/website/meta/languages.json b/website/meta/languages.json
index a7dda6482..64ca7a082 100644
--- a/website/meta/languages.json
+++ b/website/meta/languages.json
@@ -62,6 +62,11 @@
"example": "Dies ist ein Satz.",
"has_examples": true
},
+ {
+ "code": "dsb",
+ "name": "Lower Sorbian",
+ "has_examples": true
+ },
{
"code": "el",
"name": "Greek",
@@ -114,7 +119,12 @@
{
"code": "fi",
"name": "Finnish",
- "has_examples": true
+ "has_examples": true,
+ "models": [
+ "fi_core_news_sm",
+ "fi_core_news_md",
+ "fi_core_news_lg"
+ ]
},
{
"code": "fr",
@@ -154,6 +164,11 @@
"name": "Croatian",
"has_examples": true
},
+ {
+ "code": "hsb",
+ "name": "Upper Sorbian",
+ "has_examples": true
+ },
{
"code": "hu",
"name": "Hungarian",
@@ -227,7 +242,12 @@
}
],
"example": "์ด๊ฒ์ ๋ฌธ์ฅ์ ๋๋ค.",
- "has_examples": true
+ "has_examples": true,
+ "models": [
+ "ko_core_news_sm",
+ "ko_core_news_md",
+ "ko_core_news_lg"
+ ]
},
{
"code": "ky",
@@ -388,7 +408,12 @@
{
"code": "sv",
"name": "Swedish",
- "has_examples": true
+ "has_examples": true,
+ "models": [
+ "sv_core_news_sm",
+ "sv_core_news_md",
+ "sv_core_news_lg"
+ ]
},
{
"code": "ta",
diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json
index 1054f7626..cf3f1398e 100644
--- a/website/meta/sidebars.json
+++ b/website/meta/sidebars.json
@@ -11,7 +11,8 @@
{ "text": "spaCy 101", "url": "/usage/spacy-101" },
{ "text": "New in v3.0", "url": "/usage/v3" },
{ "text": "New in v3.1", "url": "/usage/v3-1" },
- { "text": "New in v3.2", "url": "/usage/v3-2" }
+ { "text": "New in v3.2", "url": "/usage/v3-2" },
+ { "text": "New in v3.3", "url": "/usage/v3-3" }
]
},
{
@@ -40,7 +41,11 @@
"label": "Resources",
"items": [
{ "text": "Project Templates", "url": "https://github.com/explosion/projects" },
- { "text": "v2.x Documentation", "url": "https://v2.spacy.io" }
+ { "text": "v2.x Documentation", "url": "https://v2.spacy.io" },
+ {
+ "text": "Custom Solutions",
+ "url": "https://explosion.ai/spacy-tailored-pipelines"
+ }
]
}
]
@@ -89,6 +94,7 @@
"items": [
{ "text": "AttributeRuler", "url": "/api/attributeruler" },
{ "text": "DependencyParser", "url": "/api/dependencyparser" },
+ { "text": "EditTreeLemmatizer", "url": "/api/edittreelemmatizer" },
{ "text": "EntityLinker", "url": "/api/entitylinker" },
{ "text": "EntityRecognizer", "url": "/api/entityrecognizer" },
{ "text": "EntityRuler", "url": "/api/entityruler" },
diff --git a/website/meta/site.json b/website/meta/site.json
index 169680f86..97051011f 100644
--- a/website/meta/site.json
+++ b/website/meta/site.json
@@ -19,7 +19,7 @@
"newsletter": {
"user": "spacy.us12",
"id": "83b0498b1e7fa3c91ce68c3f1",
- "list": "89ad33e698"
+ "list": "ecc82e0493"
},
"docSearch": {
"appId": "Y1LB128RON",
@@ -48,7 +48,11 @@
{ "text": "Usage", "url": "/usage" },
{ "text": "Models", "url": "/models" },
{ "text": "API Reference", "url": "/api" },
- { "text": "Online Course", "url": "https://course.spacy.io" }
+ { "text": "Online Course", "url": "https://course.spacy.io" },
+ {
+ "text": "Custom Solutions",
+ "url": "https://explosion.ai/spacy-tailored-pipelines"
+ }
]
},
{
diff --git a/website/meta/universe.json b/website/meta/universe.json
index b1a61598e..e37c918ca 100644
--- a/website/meta/universe.json
+++ b/website/meta/universe.json
@@ -1,5 +1,69 @@
{
"resources": [
+ {
+ "id": "scrubadub_spacy",
+ "title": "scrubadub_spacy",
+ "category": ["pipeline"],
+ "slogan": "Remove personally identifiable information from text using spaCy.",
+ "description": "scrubadub removes personally identifiable information from text. scrubadub_spacy is an extension that uses spaCy NLP models to remove personal information from text.",
+ "github": "LeapBeyond/scrubadub_spacy",
+ "pip": "scrubadub-spacy",
+ "url": "https://github.com/LeapBeyond/scrubadub_spacy",
+ "code_language": "python",
+ "author": "Leap Beyond",
+ "author_links": {
+ "github": "https://github.com/LeapBeyond",
+ "website": "https://leapbeyond.ai"
+ },
+ "code_example": [
+ "import scrubadub, scrubadub_spacy",
+ "scrubber = scrubadub.Scrubber()",
+ "scrubber.add_detector(scrubadub_spacy.detectors.SpacyEntityDetector)",
+ "print(scrubber.clean(\"My name is Alex, I work at LifeGuard in London, and my eMail is alex@lifeguard.com btw. my super secret twitter login is username: alex_2000 password: g-dragon180888\"))",
+ "# My name is {{NAME}}, I work at {{ORGANIZATION}} in {{LOCATION}}, and my eMail is {{EMAIL}} btw. my super secret twitter login is username: {{USERNAME}} password: {{PASSWORD}}"
+ ]
+ },
+ {
+ "id": "spacy-setfit-textcat",
+ "title": "spacy-setfit-textcat",
+ "category": ["research"],
+ "tags": ["SetFit", "Few-Shot"],
+ "slogan": "spaCy Project: Experiments with SetFit & Few-Shot Classification",
+ "description": "This project is an experiment with spaCy and few-shot text classification using SetFit",
+ "github": "pmbaumgartner/spacy-setfit-textcat",
+ "url": "https://github.com/pmbaumgartner/spacy-setfit-textcat",
+ "code_language": "python",
+ "author": "Peter Baumgartner",
+ "author_links": {
+ "twitter" : "https://twitter.com/pmbaumgartner",
+ "github": "https://github.com/pmbaumgartner",
+ "website": "https://www.peterbaumgartner.com/"
+ },
+ "code_example": [
+ "https://colab.research.google.com/drive/1CvGEZC0I9_v8gWrBxSJQ4Z8JGPJz-HYb?usp=sharing"
+ ]
+ },
+ {
+ "id": "spacy-experimental",
+ "title": "spacy-experimental",
+ "category": ["extension"],
+ "slogan": "Cutting-edge experimental spaCy components and features",
+ "description": "This package includes experimental components and features for spaCy v3.x, for example model architectures, pipeline components and utilities.",
+ "github": "explosion/spacy-experimental",
+ "pip": "spacy-experimental",
+ "url": "https://github.com/explosion/spacy-experimental",
+ "code_language": "python",
+ "author": "Explosion",
+ "author_links": {
+ "twitter" : "https://twitter.com/explosion_ai",
+ "github": "https://github.com/explosion",
+ "website": "https://explosion.ai/"
+ },
+ "code_example": [
+ "python -m pip install -U pip setuptools wheel",
+ "python -m pip install spacy-experimental"
+ ]
+ },
{
"id": "spacypdfreader",
"title": "spadypdfreader",
@@ -227,11 +291,11 @@
},
{
"id": "spacy-textblob",
- "title": "spaCyTextBlob",
- "slogan": "Easy sentiment analysis for spaCy using TextBlob. Now supports spaCy 3.0!",
- "thumb": "https://github.com/SamEdwardes/spaCyTextBlob/raw/main/website/static/img/logo-thumb-square-250x250.png",
- "description": "spaCyTextBlob is a pipeline component that enables sentiment analysis using the [TextBlob](https://github.com/sloria/TextBlob) library. It will add the additional extensions `._.polarity`, `._.subjectivity`, and `._.assessments` to `Doc`, `Span`, and `Token` objects. For spaCy 2 please use `pip install pip install spacytextblob==0.1.7`",
- "github": "SamEdwardes/spaCyTextBlob",
+ "title": "spacytextblob",
+ "slogan": "A TextBlob sentiment analysis pipeline component for spaCy.",
+ "thumb": "https://github.com/SamEdwardes/spacytextblob/raw/main/docs/static/img/logo-thumb-square-250x250.png",
+ "description": "spacytextblob is a pipeline component that enables sentiment analysis using the [TextBlob](https://github.com/sloria/TextBlob) library. It will add the additional extension `._.blob` to `Doc`, `Span`, and `Token` objects.",
+ "github": "SamEdwardes/spacytextblob",
"pip": "spacytextblob",
"code_example": [
"import spacy",
@@ -241,9 +305,10 @@
"nlp.add_pipe('spacytextblob')",
"text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'",
"doc = nlp(text)",
- "doc._.polarity # Polarity: -0.125",
- "doc._.subjectivity # Sujectivity: 0.9",
- "doc._.assessments # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]"
+ "doc._.blob.polarity # Polarity: -0.125",
+ "doc._.blob.subjectivity # Subjectivity: 0.9",
+ "doc._.blob.sentiment_assessments.assessments # Assessments: [(['really', 'horrible'], -1.0, 1.0, None), (['worst', '!'], -1.0, 1.0, None), (['really', 'good'], 0.7, 0.6000000000000001, None), (['happy'], 0.8, 1.0, None)]",
+ "doc._.blob.ngrams() # [WordList(['I', 'had', 'a']), WordList(['had', 'a', 'really']), WordList(['a', 'really', 'horrible']), WordList(['really', 'horrible', 'day']), WordList(['horrible', 'day', 'It']), WordList(['day', 'It', 'was']), WordList(['It', 'was', 'the']), WordList(['was', 'the', 'worst']), WordList(['the', 'worst', 'day']), WordList(['worst', 'day', 'ever']), WordList(['day', 'ever', 'But']), WordList(['ever', 'But', 'every']), WordList(['But', 'every', 'now']), WordList(['every', 'now', 'and']), WordList(['now', 'and', 'then']), WordList(['and', 'then', 'I']), WordList(['then', 'I', 'have']), WordList(['I', 'have', 'a']), WordList(['have', 'a', 'really']), WordList(['a', 'really', 'good']), WordList(['really', 'good', 'day']), WordList(['good', 'day', 'that']), WordList(['day', 'that', 'makes']), WordList(['that', 'makes', 'me']), WordList(['makes', 'me', 'happy'])]"
],
"code_language": "python",
"url": "https://spacytextblob.netlify.app/",
@@ -254,7 +319,8 @@
"website": "https://samedwardes.com"
},
"category": ["pipeline"],
- "tags": ["sentiment", "textblob"]
+ "tags": ["sentiment", "textblob"],
+ "spacy_version": 3
},
{
"id": "spacy-ray",
@@ -325,15 +391,20 @@
"pip": "spaczz",
"code_example": [
"import spacy",
- "from spaczz.pipeline import SpaczzRuler",
+ "from spaczz.matcher import FuzzyMatcher",
"",
- "nlp = spacy.blank('en')",
- "ruler = SpaczzRuler(nlp)",
- "ruler.add_patterns([{'label': 'PERSON', 'pattern': 'Bill Gates', 'type': 'fuzzy'}])",
- "nlp.add_pipe(ruler)",
+ "nlp = spacy.blank(\"en\")",
+ "text = \"\"\"Grint Anderson created spaczz in his home at 555 Fake St,",
+ "Apt 5 in Nashv1le, TN 55555-1234 in the US.\"\"\" # Spelling errors intentional.",
+ "doc = nlp(text)",
"",
- "doc = nlp('Oops, I spelled Bill Gatez wrong.')",
- "print([(ent.text, ent.start, ent.end, ent.label_) for ent in doc.ents])"
+ "matcher = FuzzyMatcher(nlp.vocab)",
+ "matcher.add(\"NAME\", [nlp(\"Grant Andersen\")])",
+ "matcher.add(\"GPE\", [nlp(\"Nashville\")])",
+ "matches = matcher(doc)",
+ "",
+ "for match_id, start, end, ratio in matches:",
+ " print(match_id, doc[start:end], ratio)"
],
"code_language": "python",
"url": "https://spaczz.readthedocs.io/en/latest/",
@@ -375,10 +446,10 @@
"title": "whatlies",
"slogan": "Make interactive visualisations to figure out 'what lies' in word embeddings.",
"description": "This small library offers tools to make visualisation easier of both word embeddings as well as operations on them. It has support for spaCy prebuilt models as a first class citizen but also offers support for sense2vec. There's a convenient API to perform linear algebra as well as support for popular transformations like PCA/UMAP/etc.",
- "github": "rasahq/whatlies",
+ "github": "koaning/whatlies",
"pip": "whatlies",
"thumb": "https://i.imgur.com/rOkOiLv.png",
- "image": "https://raw.githubusercontent.com/RasaHQ/whatlies/master/docs/gif-two.gif",
+ "image": "https://raw.githubusercontent.com/koaning/whatlies/master/docs/gif-two.gif",
"code_example": [
"from whatlies import EmbeddingSet",
"from whatlies.language import SpacyLanguage",
@@ -440,6 +511,84 @@
"website": "https://koaning.io"
}
},
+ {
+ "id": "Klayers",
+ "title": "Klayers",
+ "category": ["pipeline"],
+ "tags": ["AWS"],
+ "slogan": "spaCy as a AWS Lambda Layer",
+ "description": "A collection of Python Packages as AWS Lambda(ฮป) Layers",
+ "github": "keithrozario/Klayers",
+ "pip": "",
+ "url": "https://github.com/keithrozario/Klayers",
+ "code_language": "python",
+ "author": "Keith Rozario",
+ "author_links": {
+ "twitter" : "https://twitter.com/keithrozario",
+ "github": "https://github.com/keithrozario",
+ "website": "https://www.keithrozario.com"
+ },
+ "code_example": [
+ "# SAM Template",
+ "MyLambdaFunction:",
+ " Type: AWS::Serverless::Function",
+ " Handler: 02_pipeline/spaCy.main",
+ " Description: Name Entity Extraction",
+ " Runtime: python3.8",
+ " Layers:",
+ " - arn:aws:lambda:${self:provider.region}:113088814899:layer:Klayers-python37-spacy:18"
+ ]
+ },
+ {
+ "type": "education",
+ "id": "video-spacys-ner-model-alt",
+ "title": "Named Entity Recognition (NER) using spaCy",
+ "slogan": "",
+ "description": "In this video, I show you how to do named entity recognition using the spaCy library for Python.",
+ "youtube": "Gn_PjruUtrc",
+ "author": "Applied Language Technology",
+ "author_links": {
+ "twitter": "HelsinkiNLP",
+ "github": "Applied-Language-Technology",
+ "website": "https://applied-language-technology.mooc.fi/"
+ },
+ "category": ["videos"]
+ },
+ {
+ "id": "HuSpaCy",
+ "title": "HuSpaCy",
+ "category": ["models"],
+ "tags": ["Hungarian"],
+ "slogan": "HuSpaCy: industrial-strength Hungarian natural language processing",
+ "description": "HuSpaCy is a spaCy model and a library providing industrial-strength Hungarian language processing facilities.",
+ "github": "huspacy/huspacy",
+ "pip": "huspacy",
+ "url": "https://github.com/huspacy/huspacy",
+ "code_language": "python",
+ "author": "SzegedAI",
+ "author_links": {
+ "github": "https://szegedai.github.io/",
+ "website": "https://u-szeged.hu/english"
+ },
+ "code_example": [
+ "# Load the model using huspacy",
+ "import huspacy",
+ "",
+ "nlp = huspacy.load()",
+ "",
+ "# Load the mode using spacy.load()",
+ "import spacy",
+ "",
+ "nlp = spacy.load(\"hu_core_news_lg\")",
+ "",
+ "# Load the model directly as a module",
+ "import hu_core_news_lg",
+ "",
+ "nlp = hu_core_news_lg.load()\n",
+ "# Either way you get the same model and can start processing texts.",
+ "doc = nlp(\"Csiribiri csiribiri zabszalma - nรฉgy csillag kรถzt alszom ma.\")"
+ ]
+ },
{
"id": "spacy-stanza",
"title": "spacy-stanza",
@@ -589,23 +738,6 @@
"category": ["conversational", "standalone"],
"tags": ["chatbots"]
},
- {
- "id": "saber",
- "title": "saber",
- "slogan": "Deep-learning based tool for information extraction in the biomedical domain",
- "github": "BaderLab/saber",
- "pip": "saber",
- "thumb": "https://raw.githubusercontent.com/BaderLab/saber/master/docs/img/saber_logo.png",
- "code_example": [
- "from saber.saber import Saber",
- "saber = Saber()",
- "saber.load('PRGE')",
- "saber.annotate('The phosphorylation of Hdm2 by MK2 promotes the ubiquitination of p53.')"
- ],
- "author": "Bader Lab, University of Toronto",
- "category": ["scientific"],
- "tags": ["keras", "biomedical"]
- },
{
"id": "alibi",
"title": "alibi",
@@ -635,18 +767,17 @@
"import spacy",
"from spacymoji import Emoji",
"",
- "nlp = spacy.load('en')",
- "emoji = Emoji(nlp)",
- "nlp.add_pipe(emoji, first=True)",
+ "nlp = spacy.load(\"en_core_web_sm\")",
+ "nlp.add_pipe(\"emoji\", first=True)",
+ "doc = nlp(\"This is a test ๐ป ๐๐ฟ\")",
"",
- "doc = nlp('This is a test ๐ป ๐๐ฟ')",
- "assert doc._.has_emoji == True",
- "assert doc[2:5]._.has_emoji == True",
- "assert doc[0]._.is_emoji == False",
- "assert doc[4]._.is_emoji == True",
- "assert doc[5]._.emoji_desc == 'thumbs up dark skin tone'",
+ "assert doc._.has_emoji is True",
+ "assert doc[2:5]._.has_emoji is True",
+ "assert doc[0]._.is_emoji is False",
+ "assert doc[4]._.is_emoji is True",
+ "assert doc[5]._.emoji_desc == \"thumbs up dark skin tone\"",
"assert len(doc._.emoji) == 2",
- "assert doc._.emoji[1] == ('๐๐ฟ', 5, 'thumbs up dark skin tone')"
+ "assert doc._.emoji[1] == (\"๐๐ฟ\", 5, \"thumbs up dark skin tone\")"
],
"author": "Ines Montani",
"author_links": {
@@ -883,9 +1014,8 @@
"import spacy",
"from spacy_sentiws import spaCySentiWS",
"",
- "nlp = spacy.load('de')",
- "sentiws = spaCySentiWS(sentiws_path='data/sentiws/')",
- "nlp.add_pipe(sentiws)",
+ "nlp = spacy.load('de_core_news_sm')",
+ "nlp.add_pipe('sentiws', config={'sentiws_path': 'data/sentiws'})",
"doc = nlp('Die Dummheit der Unterwerfung blรผht in hรผbschen Farben.')",
"",
"for token in doc:",
@@ -953,6 +1083,37 @@
"category": ["pipeline"],
"tags": ["lemmatizer", "danish"]
},
+ {
+ "id": "augmenty",
+ "title": "Augmenty",
+ "slogan": "The cherry on top of your NLP pipeline",
+ "description": "Augmenty is an augmentation library based on spaCy for augmenting texts. Augmenty differs from other augmentation libraries in that it corrects (as far as possible) the token, sentence and document labels under the augmentation.",
+ "github": "kennethenevoldsen/augmenty",
+ "pip": "augmenty",
+ "code_example": [
+ "import spacy",
+ "import augmenty",
+ "",
+ "nlp = spacy.load('en_core_web_md')",
+ "",
+ "docs = nlp.pipe(['Augmenty is a great tool for text augmentation'])",
+ "",
+ "ent_dict = {'ORG': [['spaCy'], ['spaCy', 'Universe']]}",
+ "entity_augmenter = augmenty.load('ents_replace.v1',",
+ " ent_dict = ent_dict, level=1)",
+ "",
+ "for doc in augmenty.docs(docs, augmenter=entity_augmenter, nlp=nlp):",
+ " print(doc)"
+ ],
+ "thumb": "https://github.com/KennethEnevoldsen/augmenty/blob/master/img/icon.png?raw=true",
+ "author": "Kenneth Enevoldsen",
+ "author_links": {
+ "github": "kennethenevoldsen",
+ "website": "https://www.kennethenevoldsen.com"
+ },
+ "category": ["training", "research"],
+ "tags": ["training", "research", "augmentation"]
+ },
{
"id": "dacy",
"title": "DaCy",
@@ -1043,29 +1204,6 @@
"category": ["pipeline"],
"tags": ["pipeline", "readability", "syntactic complexity", "descriptive statistics"]
},
- {
- "id": "wmd-relax",
- "slogan": "Calculates word mover's distance insanely fast",
- "description": "Calculates Word Mover's Distance as described in [From Word Embeddings To Document Distances](http://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf) by Matt Kusner, Yu Sun, Nicholas Kolkin and Kilian Weinberger.\n\nโ ๏ธ **This package is currently only compatible with spaCy v.1x.**",
- "github": "src-d/wmd-relax",
- "thumb": "https://i.imgur.com/f91C3Lf.jpg",
- "code_example": [
- "import spacy",
- "import wmd",
- "",
- "nlp = spacy.load('en', create_pipeline=wmd.WMD.create_spacy_pipeline)",
- "doc1 = nlp(\"Politician speaks to the media in Illinois.\")",
- "doc2 = nlp(\"The president greets the press in Chicago.\")",
- "print(doc1.similarity(doc2))"
- ],
- "author": "source{d}",
- "author_links": {
- "github": "src-d",
- "twitter": "sourcedtech",
- "website": "https://sourced.tech"
- },
- "category": ["pipeline"]
- },
{
"id": "neuralcoref",
"slogan": "State-of-the-art coreference resolution based on neural nets and spaCy",
@@ -1492,17 +1630,6 @@
},
"category": ["nonpython"]
},
- {
- "id": "spaCy.jl",
- "slogan": "Julia interface for spaCy (work in progress)",
- "github": "jekbradbury/SpaCy.jl",
- "author": "James Bradbury",
- "author_links": {
- "github": "jekbradbury",
- "twitter": "jekbradbury"
- },
- "category": ["nonpython"]
- },
{
"id": "ruby-spacy",
"title": "ruby-spacy",
@@ -1572,21 +1699,6 @@
},
"category": ["apis"]
},
- {
- "id": "languagecrunch",
- "slogan": "NLP server for spaCy, WordNet and NeuralCoref as a Docker image",
- "github": "artpar/languagecrunch",
- "code_example": [
- "docker run -it -p 8080:8080 artpar/languagecrunch",
- "curl http://localhost:8080/nlp/parse?`echo -n \"The new twitter is so weird. Seriously. Why is there a new twitter? What was wrong with the old one? Fix it now.\" | python -c \"import urllib, sys; print(urllib.urlencode({'sentence': sys.stdin.read()}))\"`"
- ],
- "code_language": "bash",
- "author": "Parth Mudgal",
- "author_links": {
- "github": "artpar"
- },
- "category": ["apis"]
- },
{
"id": "spacy-nlp",
"slogan": " Expose spaCy NLP text parsing to Node.js (and other languages) via Socket.IO",
@@ -1975,6 +2087,20 @@
"youtube": "f4sqeLRzkPg",
"category": ["videos"]
},
+ {
+ "type": "education",
+ "id": "video-intro-to-nlp-episode-6",
+ "title": "Intro to NLP with spaCy (6)",
+ "slogan": "Episode 6: Moving to spaCy v3",
+ "description": "In this new video series, data science instructor Vincent Warmerdam gets started with spaCy, an open-source library for Natural Language Processing in Python. His mission: building a system to automatically detect programming languages in large volumes of text. Follow his process from the first idea to a prototype all the way to data collection and training a statistical named entity recogntion model from scratch.",
+ "author": "Vincent Warmerdam",
+ "author_links": {
+ "twitter": "fishnets88",
+ "github": "koaning"
+ },
+ "youtube": "k77RrmMaKEI",
+ "category": ["videos"]
+ },
{
"type": "education",
"id": "video-spacy-irl-entity-linking",
@@ -2161,43 +2287,6 @@
"category": ["standalone"],
"tags": ["question-answering", "elasticsearch"]
},
- {
- "id": "epitator",
- "title": "EpiTator",
- "thumb": "https://i.imgur.com/NYFY1Km.jpg",
- "slogan": "Extracts case counts, resolved location/species/disease names, date ranges and more",
- "description": "EcoHealth Alliance uses EpiTator to catalog the what, where and when of infectious disease case counts reported in online news. Each of these aspects is extracted using independent annotators than can be applied to other domains. EpiTator organizes annotations by creating \"AnnoTiers\" for each type. AnnoTiers have methods for manipulating, combining and searching annotations. For instance, the `with_following_spans_from()` method can be used to create a new tier that combines a tier of one type (such as numbers), with another (say, kitchenware). The resulting tier will contain all the phrases in the document that match that pattern, like \"5 plates\" or \"2 cups.\"\n\nAnother commonly used method is `group_spans_by_containing_span()` which can be used to do things like find all the spaCy tokens in all the GeoNames a document mentions. spaCy tokens, named entities, sentences and noun chunks are exposed through the spaCy annotator which will create a AnnoTier for each. These are basis of many of the other annotators. EpiTator also includes an annotator for extracting tables embedded in free text articles. Another neat feature is that the lexicons used for entity resolution are all stored in an embedded sqlite database so there is no need to run any external services in order to use EpiTator.",
- "url": "https://github.com/ecohealthalliance/EpiTator",
- "github": "ecohealthalliance/EpiTator",
- "pip": "EpiTator",
- "code_example": [
- "from epitator.annotator import AnnoDoc",
- "from epitator.geoname_annotator import GeonameAnnotator",
- "",
- "doc = AnnoDoc('Where is Chiang Mai?')",
- "geoname_annotier = doc.require_tiers('geonames', via=GeonameAnnotator)",
- "geoname = geoname_annotier.spans[0].metadata['geoname']",
- "geoname['name']",
- "# = 'Chiang Mai'",
- "geoname['geonameid']",
- "# = '1153671'",
- "geoname['latitude']",
- "# = 18.79038",
- "geoname['longitude']",
- "# = 98.98468",
- "",
- "from epitator.spacy_annotator import SpacyAnnotator",
- "spacy_token_tier = doc.require_tiers('spacy.tokens', via=SpacyAnnotator)",
- "list(geoname_annotier.group_spans_by_containing_span(spacy_token_tier))",
- "# = [(AnnoSpan(9-19, Chiang Mai), [AnnoSpan(9-15, Chiang), AnnoSpan(16-19, Mai)])]"
- ],
- "author": "EcoHealth Alliance",
- "author_links": {
- "github": "ecohealthalliance",
- "website": " https://ecohealthalliance.org/"
- },
- "category": ["scientific", "standalone"]
- },
{
"id": "self-attentive-parser",
"title": "Berkeley Neural Parser",
@@ -2226,30 +2315,6 @@
},
"category": ["research", "pipeline"]
},
- {
- "id": "excelcy",
- "title": "ExcelCy",
- "slogan": "Excel Integration with spaCy. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG.",
- "description": "ExcelCy is a toolkit to integrate Excel to spaCy NLP training experiences. Training NER using XLSX from PDF, DOCX, PPT, PNG or JPG. ExcelCy has pipeline to match Entity with PhraseMatcher or Matcher in regular expression.",
- "url": "https://github.com/kororo/excelcy",
- "github": "kororo/excelcy",
- "pip": "excelcy",
- "code_example": [
- "from excelcy import ExcelCy",
- "# collect sentences, annotate Entities and train NER using spaCy",
- "excelcy = ExcelCy.execute(file_path='https://github.com/kororo/excelcy/raw/master/tests/data/test_data_01.xlsx')",
- "# use the nlp object as per spaCy API",
- "doc = excelcy.nlp('Google rebrands its business apps')",
- "# or save it for faster bootstrap for application",
- "excelcy.nlp.to_disk('/model')"
- ],
- "author": "Robertus Johansyah",
- "author_links": {
- "github": "kororo"
- },
- "category": ["training"],
- "tags": ["excel"]
- },
{
"id": "spacy-graphql",
"title": "spacy-graphql",
@@ -2372,18 +2437,17 @@
{
"id": "spacy-conll",
"title": "spacy_conll",
- "slogan": "Parsing to CoNLL with spaCy, spacy-stanza, and spacy-udpipe",
- "description": "This module allows you to parse text into CoNLL-U format. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a spaCy, spacy-stanfordnlp, spacy-stanza, or spacy-udpipe pipeline. It also provides an easy-to-use function to quickly initialize a parser. CoNLL-related properties are added to Doc elements, sentence Spans, and Tokens.",
+ "slogan": "Parsing from and to CoNLL-U format with `spacy`, `spacy-stanza` and `spacy-udpipe`",
+ "description": "This module allows you to parse text into CoNLL-U format or read ConLL-U into a spaCy `Doc`. You can use it as a command line tool, or embed it in your own scripts by adding it as a custom pipeline component to a `spacy`, `spacy-stanza` or `spacy-udpipe` pipeline. It also provides an easy-to-use function to quickly initialize any spaCy-wrapped parser. CoNLL-related properties are added to `Doc` elements, `Span` sentences, and `Token` objects.",
"code_example": [
"from spacy_conll import init_parser",
"",
"",
"# Initialise English parser, already including the ConllFormatter as a pipeline component.",
"# Indicate that we want to get the CoNLL headers in the string output.",
- "# `use_gpu` and `verbose` are specific to stanza (and stanfordnlp). These keywords arguments",
- "# are passed onto their Pipeline() initialisation",
- "nlp = init_parser(\"stanza\",",
- " \"en\",",
+ "# `use_gpu` and `verbose` are specific to stanza. These keywords arguments are passed onto their Pipeline() initialisation",
+ "nlp = init_parser(\"en\",",
+ " \"stanza\",",
" parser_opts={\"use_gpu\": True, \"verbose\": False},",
" include_headers=True)",
"# Parse a given string",
@@ -2402,7 +2466,7 @@
},
"github": "BramVanroy/spacy_conll",
"category": ["standalone", "pipeline"],
- "tags": ["linguistics", "computational linguistics", "conll"]
+ "tags": ["linguistics", "computational linguistics", "conll", "conll-u"]
},
{
"id": "spacy-langdetect",
@@ -2464,41 +2528,6 @@
},
"category": ["standalone", "conversational"]
},
- {
- "id": "gracyql",
- "title": "gracyql",
- "slogan": "A thin GraphQL wrapper around spacy",
- "github": "oterrier/gracyql",
- "description": "An example of a basic [Starlette](https://github.com/encode/starlette) app using [Spacy](https://github.com/explosion/spaCy) and [Graphene](https://github.com/graphql-python/graphene). The main goal is to be able to use the amazing power of spaCy from other languages and retrieving only the information you need thanks to the GraphQL query definition. The GraphQL schema tries to mimic as much as possible the original Spacy API with classes Doc, Span and Token.",
- "thumb": "https://i.imgur.com/xC7zpTO.png",
- "category": ["apis"],
- "tags": ["graphql"],
- "code_example": [
- "query ParserDisabledQuery {",
- " nlp(model: \"en\", disable: [\"parser\", \"ner\"]) {",
- " doc(text: \"I live in Grenoble, France\") {",
- " text",
- " tokens {",
- " id",
- " pos",
- " lemma",
- " dep",
- " }",
- " ents {",
- " start",
- " end",
- " label",
- " }",
- " }",
- " }",
- "}"
- ],
- "code_language": "json",
- "author": "Olivier Terrier",
- "author_links": {
- "github": "oterrier"
- }
- },
{
"id": "pyInflect",
"slogan": "A Python module for word inflections",
@@ -2566,6 +2595,172 @@
},
"category": ["pipeline"]
},
+ {
+ "id": "classyclassification",
+ "title": "Classy Classification",
+ "slogan": "Have you ever struggled with needing a spaCy TextCategorizer but didn't have the time to train one from scratch? Classy Classification is the way to go!",
+ "description": "Have you ever struggled with needing a [spaCy TextCategorizer](https://spacy.io/api/textcategorizer) but didn't have the time to train one from scratch? Classy Classification is the way to go! For few-shot classification using [sentence-transformers](https://github.com/UKPLab/sentence-transformers) or [spaCy models](https://spacy.io/usage/models), provide a dictionary with labels and examples, or just provide a list of labels for zero shot-classification with [Huggingface zero-shot classifiers](https://huggingface.co/models?pipeline_tag=zero-shot-classification).",
+ "github": "davidberenstein1957/classy-classification",
+ "pip": "classy-classification",
+ "thumb": "https://raw.githubusercontent.com/Pandora-Intelligence/classy-classification/master/logo.png",
+ "code_example": [
+ "import spacy",
+ "import classy_classification",
+ "",
+ "data = {",
+ " \"furniture\": [\"This text is about chairs.\",",
+ " \"Couches, benches and televisions.\",",
+ " \"I really need to get a new sofa.\"],",
+ " \"kitchen\": [\"There also exist things like fridges.\",",
+ " \"I hope to be getting a new stove today.\",",
+ " \"Do you also have some ovens.\"]",
+ "}",
+ "",
+ "# see github repo for examples on sentence-transformers and Huggingface",
+ "nlp = spacy.load('en_core_web_md')",
+ "nlp.add_pipe(\"text_categorizer\", ",
+ " config={",
+ " \"data\": data,",
+ " \"model\": \"spacy\"",
+ " }",
+ ")",
+ "",
+ "print(nlp(\"I am looking for kitchen appliances.\")._.cats)",
+ "# Output:",
+ "#",
+ "# [{\"label\": \"furniture\", \"score\": 0.21}, {\"label\": \"kitchen\", \"score\": 0.79}]"
+ ],
+ "author": "David Berenstein",
+ "author_links": {
+ "github": "davidberenstein1957",
+ "website": "https://www.linkedin.com/in/david-berenstein-1bab11105/"
+ },
+ "category": [
+ "pipeline",
+ "standalone"
+ ],
+ "tags": [
+ "classification",
+ "zero-shot",
+ "few-shot",
+ "sentence-transformers",
+ "huggingface"
+ ],
+ "spacy_version": 3
+ },
+ {
+ "id": "conciseconcepts",
+ "title": "Concise Concepts",
+ "slogan": "Concise Concepts uses few-shot NER based on word embedding similarity to get you going with easy!",
+ "description": "When wanting to apply NER to concise concepts, it is really easy to come up with examples, but it takes some effort to train an entire pipeline. Concise Concepts uses few-shot NER based on word embedding similarity to get you going with easy!",
+ "github": "pandora-intelligence/concise-concepts",
+ "pip": "concise-concepts",
+ "thumb": "https://raw.githubusercontent.com/Pandora-Intelligence/concise-concepts/master/img/logo.png",
+ "image": "https://raw.githubusercontent.com/Pandora-Intelligence/concise-concepts/master/img/example.png",
+ "code_example": [
+ "import spacy",
+ "from spacy import displacy",
+ "import concise_concepts",
+ "",
+ "data = {",
+ " \"fruit\": [\"apple\", \"pear\", \"orange\"],",
+ " \"vegetable\": [\"broccoli\", \"spinach\", \"tomato\"],",
+ " \"meat\": [\"beef\", \"pork\", \"fish\", \"lamb\"]",
+ "}",
+ "",
+ "text = \"\"\"",
+ " Heat the oil in a large pan and add the Onion, celery and carrots.",
+ " Then, cook over a mediumโlow heat for 10 minutes, or until softened.",
+ " Add the courgette, garlic, red peppers and oregano and cook for 2โ3 minutes.",
+ " Later, add some oranges and chickens.\"\"\"",
+ "",
+ "# use any model that has internal spacy embeddings",
+ "nlp = spacy.load('en_core_web_lg')",
+ "nlp.add_pipe(\"concise_concepts\", ",
+ " config={\"data\": data}",
+ ")",
+ "doc = nlp(text)",
+ "",
+ "options = {\"colors\": {\"fruit\": \"darkorange\", \"vegetable\": \"limegreen\", \"meat\": \"salmon\"},",
+ " \"ents\": [\"fruit\", \"vegetable\", \"meat\"]}",
+ "",
+ "displacy.render(doc, style=\"ent\", options=options)"
+ ],
+ "author": "David Berenstein",
+ "author_links": {
+ "github": "davidberenstein1957",
+ "website": "https://www.linkedin.com/in/david-berenstein-1bab11105/"
+ },
+ "category": [
+ "pipeline"
+ ],
+ "tags": [
+ "ner",
+ "few-shot",
+ "gensim"
+ ],
+ "spacy_version": 3
+ },
+ {
+ "id": "crosslingualcoreference",
+ "title": "Crosslingual Coreference",
+ "slogan": "One multi-lingual coreference model to rule them all!",
+ "description": "Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also data proved to be poorly annotated. Crosslingual Coreference therefore uses the assumption a trained model with English data and cross-lingual embeddings should work for other languages with similar sentence structure. Verified to work quite well for at least (EN, NL, DK, FR, DE).",
+ "github": "pandora-intelligence/crosslingual-coreference",
+ "pip": "crosslingual-coreference",
+ "thumb": "https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/logo.png",
+ "image": "https://raw.githubusercontent.com/Pandora-Intelligence/crosslingual-coreference/master/img/example_total.png",
+ "code_example": [
+ "import spacy",
+ "import crosslingual_coreference",
+ "",
+ "text = \"\"\"",
+ " Do not forget about Momofuku Ando!",
+ " He created instant noodles in Osaka.",
+ " At that location, Nissin was founded.",
+ " Many students survived by eating these noodles, but they don't even know him.\"\"\"",
+ "",
+ "# use any model that has internal spacy embeddings",
+ "nlp = spacy.load('en_core_web_sm')",
+ "nlp.add_pipe(",
+ " \"xx_coref\", config={\"chunk_size\": 2500, \"chunk_overlap\": 2, \"device\": 0})",
+ ")",
+ "",
+ "doc = nlp(text)",
+ "",
+ "print(doc._.coref_clusters)",
+ "# Output",
+ "#",
+ "# [[[4, 5], [7, 7], [27, 27], [36, 36]],",
+ "# [[12, 12], [15, 16]],",
+ "# [[9, 10], [27, 28]],",
+ "# [[22, 23], [31, 31]]]",
+ "print(doc._.resolved_text)",
+ "# Output",
+ "#",
+ "# Do not forget about Momofuku Ando!",
+ "# Momofuku Ando created instant noodles in Osaka.",
+ "# At Osaka, Nissin was founded.",
+ "# Many students survived by eating instant noodles,",
+ "# but Many students don't even know Momofuku Ando."
+ ],
+ "author": "David Berenstein",
+ "author_links": {
+ "github": "davidberenstein1957",
+ "website": "https://www.linkedin.com/in/david-berenstein-1bab11105/"
+ },
+ "category": [
+ "pipeline",
+ "standalone"
+ ],
+ "tags": [
+ "coreference",
+ "multi-lingual",
+ "cross-lingual",
+ "allennlp"
+ ],
+ "spacy_version": 3
+ },
{
"id": "blackstone",
"title": "Blackstone",
@@ -2622,9 +2817,9 @@
"id": "coreferee",
"title": "Coreferee",
"slogan": "Coreference resolution for multiple languages",
- "github": "msg-systems/coreferee",
- "url": "https://github.com/msg-systems/coreferee",
- "description": "Coreferee is a pipeline plugin that performs coreference resolution for English, German and Polish. It is designed so that it is easy to add support for new languages and optimised for limited training data. It uses a mixture of neural networks and programmed rules. Please note you will need to [install models](https://github.com/msg-systems/coreferee#getting-started) before running the code example.",
+ "github": "explosion/coreferee",
+ "url": "https://github.com/explosion/coreferee",
+ "description": "Coreferee is a pipeline plugin that performs coreference resolution for English, French, German and Polish. It is designed so that it is easy to add support for new languages and optimised for limited training data. It uses a mixture of neural networks and programmed rules. Please note you will need to [install models](https://github.com/explosion/coreferee#getting-started) before running the code example.",
"pip": "coreferee",
"category": ["pipeline", "models", "standalone"],
"tags": ["coreference-resolution", "anaphora"],
@@ -2982,18 +3177,25 @@
"import spacy",
"import pytextrank",
"",
- "nlp = spacy.load('en_core_web_sm')",
+ "# example text",
+ "text = \"\"\"Compatibility of systems of linear constraints over the set of natural numbers.",
+ "Criteria of compatibility of a system of linear Diophantine equations, strict inequations,",
+ "and nonstrict inequations are considered. Upper bounds for components of a minimal set of",
+ "solutions and algorithms of construction of minimal generating sets of solutions for all types",
+ "of systems are given. These criteria and the corresponding algorithms for constructing a minimal",
+ "supporting set of solutions can be used in solving all the considered types systems and systems of mixed types.\"\"\"",
"",
- "tr = pytextrank.TextRank()",
- "nlp.add_pipe(tr.PipelineComponent, name='textrank', last=True)",
+ "# load a spaCy model, depending on language, scale, etc.",
+ "nlp = spacy.load(\"en_core_web_sm\")",
+ "# add PyTextRank to the spaCy pipeline",
+ "nlp.add_pipe(\"textrank\")",
"",
- "text = 'Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered.'",
"doc = nlp(text)",
- "",
"# examine the top-ranked phrases in the document",
- "for p in doc._.phrases:",
- " print('{:.4f} {:5d} {}'.format(p.rank, p.count, p.text))",
- " print(p.chunks)"
+ "for phrase in doc._.phrases:",
+ " print(phrase.text)",
+ " print(phrase.rank, phrase.count)",
+ " print(phrase.chunks)"
],
"code_language": "python",
"url": "https://github.com/DerwenAI/pytextrank/wiki",
@@ -3019,21 +3221,13 @@
"import spacy",
"from spacy_syllables import SpacySyllables",
"",
- "nlp = spacy.load('en_core_web_sm')",
- "syllables = SpacySyllables(nlp)",
- "nlp.add_pipe(syllables, after='tagger')",
+ "nlp = spacy.load(\"en_core_web_sm\")",
+ "nlp.add_pipe(\"syllables\", after=\"tagger\")",
"",
- "doc = nlp('terribly long')",
- "",
- "data = [",
- " (token.text, token._.syllables, token._.syllables_count)",
- " for token in doc",
- "]",
- "",
- "assert data == [",
- " ('terribly', ['ter', 'ri', 'bly'], 3),",
- " ('long', ['long'], 1)",
- "]"
+ "assert nlp.pipe_names == [\"tok2vec\", \"tagger\", \"syllables\", \"parser\", \"attribute_ruler\", \"lemmatizer\", \"ner\"]",
+ "doc = nlp(\"terribly long\")",
+ "data = [(token.text, token._.syllables, token._.syllables_count) for token in doc]",
+ "assert data == [(\"terribly\", [\"ter\", \"ri\", \"bly\"], 3), (\"long\", [\"long\"], 1)]"
],
"thumb": "https://raw.githubusercontent.com/sloev/spacy-syllables/master/logo.png",
"author": "Johannes Valbjรธrn",
@@ -3738,6 +3932,107 @@
},
"category": ["pipeline"],
"tags": ["pipeline", "nlp", "sentiment"]
+ },
+ {
+ "id": "textnets",
+ "slogan": "Text analysis with networks",
+ "description": "textnets represents collections of texts as networks of documents and words. This provides novel possibilities for the visualization and analysis of texts.",
+ "github": "jboynyc/textnets",
+ "image": "https://user-images.githubusercontent.com/2187261/152641425-6c0fb41c-b8e0-44fb-a52a-7c1ba24eba1e.png",
+ "code_example": [
+ "import textnets as tn",
+ "",
+ "corpus = tn.Corpus(tn.examples.moon_landing)",
+ "t = tn.Textnet(corpus.tokenized(), min_docs=1)",
+ "t.plot(label_nodes=True,",
+ " show_clusters=True,",
+ " scale_nodes_by=\"birank\",",
+ " scale_edges_by=\"weight\")"
+ ],
+ "author": "John Boy",
+ "author_links": {
+ "github": "jboynyc",
+ "twitter": "jboy"
+ },
+ "category": ["visualizers", "standalone"]
+ },
+ {
+ "id": "tmtoolkit",
+ "slogan": "Text mining and topic modeling toolkit",
+ "description": "tmtoolkit is a set of tools for text mining and topic modeling with Python developed especially for the use in the social sciences, in journalism or related disciplines. It aims for easy installation, extensive documentation and a clear programming interface while offering good performance on large datasets by the means of vectorized operations (via NumPy) and parallel computation (using Pythonโs multiprocessing module and the loky package).",
+ "github": "WZBSocialScienceCenter/tmtoolkit",
+ "code_example": [
+ "# Note: This requires these setup steps:",
+ "# pip install tmtoolkit[recommended]",
+ "# python -m tmtoolkit setup en",
+ "from tmtoolkit.corpus import Corpus, tokens_table, lemmatize, to_lowercase, dtm",
+ "from tmtoolkit.bow.bow_stats import tfidf, sorted_terms_table",
+ "# load built-in sample dataset and use 4 worker processes",
+ "corp = Corpus.from_builtin_corpus('en-News100', max_workers=4)",
+ "# investigate corpus as dataframe",
+ "toktbl = tokens_table(corp)",
+ "print(toktbl)",
+ "# apply some text normalization",
+ "lemmatize(corp)",
+ "to_lowercase(corp)",
+ "# build sparse document-token matrix (DTM)",
+ "# document labels identify rows, vocabulary tokens identify columns",
+ "mat, doc_labels, vocab = dtm(corp, return_doc_labels=True, return_vocab=True)",
+ "# apply tf-idf transformation to DTM",
+ "# operation is applied on sparse matrix and uses few memory",
+ "tfidf_mat = tfidf(mat)",
+ "# show top 5 tokens per document ranked by tf-idf",
+ "top_tokens = sorted_terms_table(tfidf_mat, vocab, doc_labels, top_n=5)",
+ "print(top_tokens)"
+ ],
+ "author": "Markus Konrad / WZB Social Science Center",
+ "author_links": {
+ "github": "internaut",
+ "twitter": "_knrd"
+ },
+ "category": ["scientific", "standalone"]
+ },
+ {
+ "id": "edsnlp",
+ "title": "EDS-NLP",
+ "slogan": "spaCy components to extract information from clinical notes written in French.",
+ "description": "EDS-NLP provides a set of rule-based spaCy components to extract information for French clinical notes. It also features _qualifier_ pipelines that detect negations, speculations and family context, among other modalities. Check out the [demo](https://aphp.github.io/edsnlp/demo/)!",
+ "github": "aphp/edsnlp",
+ "pip": "edsnlp",
+ "code_example": [
+ "import spacy",
+ "",
+ "nlp = spacy.blank(\"fr\")",
+ "",
+ "terms = dict(",
+ " covid=[\"covid\", \"coronavirus\"],",
+ ")",
+ "",
+ "# Sentencizer component, needed for negation detection",
+ "nlp.add_pipe(\"eds.sentences\")",
+ "# Matcher component",
+ "nlp.add_pipe(\"eds.matcher\", config=dict(terms=terms))",
+ "# Negation detection",
+ "nlp.add_pipe(\"eds.negation\")",
+ "",
+ "# Process your text in one call !",
+ "doc = nlp(\"Le patient est atteint de covid\")",
+ "",
+ "doc.ents",
+ "# Out: (covid,)",
+ "",
+ "doc.ents[0]._.negation",
+ "# Out: False"
+ ],
+ "code_language": "python",
+ "url": "https://aphp.github.io/edsnlp/",
+ "author": "AP-HP",
+ "author_links": {
+ "github": "aphp",
+ "website": "https://github.com/aphp"
+ },
+ "category": ["biomedical", "scientific", "research", "pipeline"],
+ "tags": ["clinical"]
}
],
diff --git a/website/setup/jinja_to_js.py b/website/setup/jinja_to_js.py
index e2eca7ffb..3e1963ff7 100644
--- a/website/setup/jinja_to_js.py
+++ b/website/setup/jinja_to_js.py
@@ -206,7 +206,6 @@ class JinjaToJS(object):
self.environment = Environment(
loader=FileSystemLoader(template_root),
autoescape=True,
- extensions=["jinja2.ext.with_", "jinja2.ext.autoescape"],
)
self.output = StringIO()
self.stored_names = set()
diff --git a/website/setup/requirements.txt b/website/setup/requirements.txt
index e7a8e65a7..cbd306cc3 100644
--- a/website/setup/requirements.txt
+++ b/website/setup/requirements.txt
@@ -1,3 +1,3 @@
# These are used to compile the training quickstart config
-jinja2
+jinja2>=3.1.0
srsly
diff --git a/website/src/components/list.js b/website/src/components/list.js
index e0a3d9b64..d31617487 100644
--- a/website/src/components/list.js
+++ b/website/src/components/list.js
@@ -6,11 +6,14 @@ import { replaceEmoji } from './icon'
export const Ol = props =>
export const Ul = props =>
)
diff --git a/website/src/styles/list.module.sass b/website/src/styles/list.module.sass
index 588b30ba0..1a352d9dd 100644
--- a/website/src/styles/list.module.sass
+++ b/website/src/styles/list.module.sass
@@ -36,6 +36,16 @@
box-sizing: content-box
vertical-align: top
+.emoji:before
+ content: attr(data-emoji)
+ padding-right: 0.75em
+ padding-top: 0
+ margin-left: -2.5em
+ width: 1.75em
+ text-align: right
+ font-size: 1em
+ position: static
+
.li-icon
text-indent: calc(-20px - 0.55em)
diff --git a/website/src/templates/index.js b/website/src/templates/index.js
index dfd59e424..bdbdbd431 100644
--- a/website/src/templates/index.js
+++ b/website/src/templates/index.js
@@ -120,8 +120,8 @@ const AlertSpace = ({ nightly, legacy }) => {
}
const navAlert = (
-
- ๐ฅ Out now: spaCy v3.2
+
+ ๐ฅ Out now: spaCy v3.3
)
diff --git a/website/src/widgets/landing.js b/website/src/widgets/landing.js
index 74607fd09..b7ae35f6e 100644
--- a/website/src/widgets/landing.js
+++ b/website/src/widgets/landing.js
@@ -15,9 +15,9 @@ import {
} from '../components/landing'
import { H2 } from '../components/typography'
import { InlineCode } from '../components/code'
+import { Ul, Li } from '../components/list'
import Button from '../components/button'
import Link from '../components/link'
-import { YouTube } from '../components/embed'
import QuickstartTraining from './quickstart-training'
import Project from './project'
@@ -25,6 +25,7 @@ import Features from './features'
import courseImage from '../../docs/images/course.jpg'
import prodigyImage from '../../docs/images/prodigy_overview.jpg'
import projectsImage from '../../docs/images/projects.png'
+import tailoredPipelinesImage from '../../docs/images/spacy-tailored-pipelines_wide.png'
import Benchmarks from 'usage/_benchmarks-models.md'
@@ -104,23 +105,45 @@ const Landing = ({ data }) => {
- spaCy v3.0 features all new transformer-based pipelines that
- bring spaCy's accuracy right up to the current state-of-the-art
- . You can use any pretrained transformer to train your own pipelines, and even
- share one transformer between multiple components with{' '}
- multi-task learning. Training is now fully configurable and
- extensible, and you can define your own custom models using{' '}
- PyTorch, TensorFlow and other frameworks. The
- new spaCy projects system lets you describe whole{' '}
- end-to-end workflows in a single file, giving you an easy path
- from prototype to production, and making it easy to clone and adapt
- best-practice projects for your own use cases.
+
+
+
+
+ Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's
+ core developers.
+
+
+
+
+
+ Streamlined. Nobody knows spaCy better than we do. Send
+ us your pipeline requirements and we'll be ready to start producing your
+ solution in no time at all.
+
+
+ Production ready. spaCy pipelines are robust and easy
+ to deploy. You'll get a complete spaCy project folder which is ready to{' '}
+ spacy project run.
+
+
+ Predictable. You'll know exactly what you're going to
+ get and what it's going to cost. We quote fees up-front, let you try
+ before you buy, and don't charge for over-runs at our end โ all the risk
+ is on us.
+
+
+ Maintainable. spaCy is an industry standard, and we'll
+ deliver your pipeline with full code, data, tests and documentation, so
+ your team can retrain, update and extend the solution as your
+ requirements change.
+
+
{
-
-
+
+ spaCy v3.0 features all new transformer-based pipelines that
+ bring spaCy's accuracy right up to the current state-of-the-art
+ . You can use any pretrained transformer to train your own pipelines, and even
+ share one transformer between multiple components with{' '}
+ multi-task learning. Training is now fully configurable and
+ extensible, and you can define your own custom models using{' '}
+ PyTorch, TensorFlow and other frameworks.
{
{nightly ? ` --branch ${DEFAULT_BRANCH}` : ''}
cd spaCy
-
- export PYTHONPATH=`pwd`
-
-
- set PYTHONPATH=C:\path\to\spaCy
- pip install -r requirements.txt
- python setup.py build_ext --inplace
- pip install {train || hardware == 'gpu' ? `'.[${pipExtras}]'` : '.'}
+ pip install --no-build-isolation --editable {train || hardware == 'gpu' ? `'.[${pipExtras}]'` : '.'}
# packages only available via pip
diff --git a/website/src/widgets/quickstart-training.js b/website/src/widgets/quickstart-training.js
index 2d3a0e679..fbeeaf79d 100644
--- a/website/src/widgets/quickstart-training.js
+++ b/website/src/widgets/quickstart-training.js
@@ -10,7 +10,7 @@ const DEFAULT_LANG = 'en'
const DEFAULT_HARDWARE = 'cpu'
const DEFAULT_OPT = 'efficiency'
const DEFAULT_TEXTCAT_EXCLUSIVE = true
-const COMPONENTS = ['tagger', 'morphologizer', 'parser', 'ner', 'textcat']
+const COMPONENTS = ['tagger', 'morphologizer', 'trainable_lemmatizer', 'parser', 'ner', 'spancat', 'textcat']
const COMMENT = `# This is an auto-generated partial config. To use it with 'spacy train'
# you can run spacy init fill-config to auto-fill all default settings:
# python -m spacy init fill-config ./base_config.cfg ./config.cfg`