Merge pull request #10777 from adrianeboyd/chore/update-develop-v3.4

Update develop for v3.4
This commit is contained in:
Sofie Van Landeghem 2022-05-10 09:43:24 +02:00 committed by GitHub
commit 6d17168c4d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
181 changed files with 8169 additions and 1084 deletions

View File

@ -4,6 +4,8 @@ about: Use this template if you came across a bug or unexpected behaviour differ
--- ---
<!-- NOTE: For questions or install related issues, please open a Discussion instead. -->
## How to reproduce the behaviour ## How to reproduce the behaviour
<!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. --> <!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->

View File

@ -1,8 +1,5 @@
blank_issues_enabled: false blank_issues_enabled: false
contact_links: contact_links:
- name: ⚠️ Python 3.10 Support
url: https://github.com/explosion/spaCy/discussions/9418
about: Python 3.10 wheels haven't been released yet, see the link for details.
- name: 🗯 Discussions Forum - name: 🗯 Discussions Forum
url: https://github.com/explosion/spaCy/discussions url: https://github.com/explosion/spaCy/discussions
about: Install issues, usage questions, general discussion and anything else that isn't a bug report. about: Install issues, usage questions, general discussion and anything else that isn't a bug report.

106
.github/contributors/fonfonx.md vendored Normal file
View File

@ -0,0 +1,106 @@
# spaCy contributor agreement
This spaCy Contributor Agreement (**"SCA"**) is based on the
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
The SCA applies to any contribution that you make to any product or project
managed by us (the **"project"**), and sets out the intellectual property rights
you grant to us in the contributed materials. The term **"us"** shall mean
[ExplosionAI GmbH](https://explosion.ai/legal). The term
**"you"** shall mean the person or entity identified below.
If you agree to be bound by these terms, fill in the information requested
below and include the filled-in version with your first pull request, under the
folder [`.github/contributors/`](/.github/contributors/). The name of the file
should be your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.
Read this agreement carefully before signing. These terms and conditions
constitute a binding legal agreement.
## Contributor Agreement
1. The term "contribution" or "contributed materials" means any source code,
object code, patch, tool, sample, graphic, specification, manual,
documentation, or any other material posted or submitted by you to the project.
2. With respect to any worldwide copyrights, or copyright applications and
registrations, in your contribution:
* you hereby assign to us joint ownership, and to the extent that such
assignment is or becomes invalid, ineffective or unenforceable, you hereby
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
royalty-free, unrestricted license to exercise all rights under those
copyrights. This includes, at our option, the right to sublicense these same
rights to third parties through multiple levels of sublicensees or other
licensing arrangements;
* you agree that each of us can do all things in relation to your
contribution as if each of us were the sole owners, and if one of us makes
a derivative work of your contribution, the one who makes the derivative
work (or has it made will be the sole owner of that derivative work;
* you agree that you will not assert any moral rights in your contribution
against us, our licensees or transferees;
* you agree that we may register a copyright in your contribution and
exercise all ownership rights associated with it; and
* you agree that neither of us has any duty to consult with, obtain the
consent of, pay or render an accounting to the other for any use or
distribution of your contribution.
3. With respect to any patents you own, or that you can license without payment
to any third party, you hereby grant to us a perpetual, irrevocable,
non-exclusive, worldwide, no-charge, royalty-free license to:
* make, have made, use, sell, offer to sell, import, and otherwise transfer
your contribution in whole or in part, alone or in combination with or
included in any product, work or materials arising out of the project to
which your contribution was submitted, and
* at our option, to sublicense these same rights to third parties through
multiple levels of sublicensees or other licensing arrangements.
4. Except as set out above, you keep all right, title, and interest in your
contribution. The rights that you grant to us under these terms are effective
on the date you first submitted a contribution to us, even if your submission
took place before the date you sign these terms.
5. You covenant, represent, warrant and agree that:
* Each contribution that you submit is and shall be an original work of
authorship and you can legally grant the rights set out in this SCA;
* to the best of your knowledge, each contribution will not violate any
third party's copyrights, trademarks, patents, or other intellectual
property rights; and
* each contribution shall be in compliance with U.S. export control laws and
other applicable export and import laws. You agree to notify us if you
become aware of any circumstance which would make any of the foregoing
representations inaccurate in any respect. We may publicly disclose your
participation in the project, including the fact that you have signed the SCA.
6. This SCA is governed by the laws of the State of California and applicable
U.S. Federal law. Any choice of law rules will not apply.
7. Please place an “x” on one of the applicable statement below. Please do NOT
mark both statements:
* [x] I am signing on behalf of myself as an individual and no other person
or entity, including my employer, has or will have rights with respect to my
contributions.
* [ ] I am signing on behalf of my employer or a legal entity and I have the
actual authority to contractually bind that entity.
## Contributor Details
| Field | Entry |
|------------------------------- | -------------------- |
| Name | Xavier Fontaine |
| Company name (if applicable) | |
| Title or role (if applicable) | |
| Date | 2022-04-13 |
| GitHub username | fonfonx |
| Website (optional) | |

21
.github/workflows/gputests.yml vendored Normal file
View File

@ -0,0 +1,21 @@
name: Weekly GPU tests
on:
schedule:
- cron: '0 1 * * MON'
jobs:
weekly-gputests:
strategy:
fail-fast: false
matrix:
branch: [master, v4]
runs-on: ubuntu-latest
steps:
- name: Trigger buildkite build
uses: buildkite/trigger-pipeline-action@v1.2.0
env:
PIPELINE: explosion-ai/spacy-slow-gpu-tests
BRANCH: ${{ matrix.branch }}
MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}

37
.github/workflows/slowtests.yml vendored Normal file
View File

@ -0,0 +1,37 @@
name: Daily slow tests
on:
schedule:
- cron: '0 0 * * *'
jobs:
daily-slowtests:
strategy:
fail-fast: false
matrix:
branch: [master, v4]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v1
with:
ref: ${{ matrix.branch }}
- name: Get commits from past 24 hours
id: check_commits
run: |
today=$(date '+%Y-%m-%d %H:%M:%S')
yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
if git log --after="$yesterday" --before="$today" | grep commit ; then
echo "::set-output name=run_tests::true"
else
echo "::set-output name=run_tests::false"
fi
- name: Trigger buildkite build
if: steps.check_commits.outputs.run_tests == 'true'
uses: buildkite/trigger-pipeline-action@v1.2.0
env:
PIPELINE: explosion-ai/spacy-slow-tests
BRANCH: ${{ matrix.branch }}
MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}

1
.gitignore vendored
View File

@ -9,7 +9,6 @@ keys/
spacy/tests/package/setup.cfg spacy/tests/package/setup.cfg
spacy/tests/package/pyproject.toml spacy/tests/package/pyproject.toml
spacy/tests/package/requirements.txt spacy/tests/package/requirements.txt
spacy/tests/universe/universe.json
# Website # Website
website/.cache/ website/.cache/

View File

@ -1,9 +1,10 @@
repos: repos:
- repo: https://github.com/ambv/black - repo: https://github.com/ambv/black
rev: 21.6b0 rev: 22.3.0
hooks: hooks:
- id: black - id: black
language_version: python3.7 language_version: python3.7
additional_dependencies: ['click==8.0.4']
- repo: https://gitlab.com/pycqa/flake8 - repo: https://gitlab.com/pycqa/flake8
rev: 3.9.2 rev: 3.9.2
hooks: hooks:

View File

@ -233,7 +233,7 @@ also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black` (i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings. (see above), you shouldn't see any formatting-related warnings.
The [`.flake8`](.flake8) config defines the configuration we use for this The `flake8` section in [`setup.cfg`](setup.cfg) defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables. excluding very large files like lemmatization and tokenizer exception tables.

View File

@ -33,7 +33,7 @@ open-source software, released under the MIT license.
## 📖 Documentation ## 📖 Documentation
| Documentation | | | Documentation | |
| -------------------------- | -------------------------------------------------------------- | | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! | | ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
| 📚 **[Usage Guides]** | How to use spaCy and its features. | | 📚 **[Usage Guides]** | How to use spaCy and its features. |
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. | | 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
@ -45,6 +45,7 @@ open-source software, released under the MIT license.
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. | | 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
| 🛠 **[Changelog]** | Changes and version history. | | 🛠 **[Changelog]** | Changes and version history. |
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. | | 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more &rarr;](https://explosion.ai/spacy-tailored-pipelines)** |
[spacy 101]: https://spacy.io/usage/spacy-101 [spacy 101]: https://spacy.io/usage/spacy-101
[new in v3.0]: https://spacy.io/usage/v3 [new in v3.0]: https://spacy.io/usage/v3
@ -60,9 +61,7 @@ open-source software, released under the MIT license.
## 💬 Where to ask questions ## 💬 Where to ask questions
The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**, The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
Please understand that we won't be able to provide individual support via email. Please understand that we won't be able to provide individual support via email.
We also believe that help is much more valuable if it's shared publicly, so that We also believe that help is much more valuable if it's shared publicly, so that
more people can benefit from it. more people can benefit from it.

View File

@ -11,12 +11,14 @@ trigger:
exclude: exclude:
- "website/*" - "website/*"
- "*.md" - "*.md"
- ".github/workflows/*"
pr: pr:
paths: paths:
exclude: exclude:
- "*.md" - "*.md"
- "website/docs/*" - "website/docs/*"
- "website/src/*" - "website/src/*"
- ".github/workflows/*"
jobs: jobs:
# Perform basic checks for most important errors (syntax etc.) Uses the config # Perform basic checks for most important errors (syntax etc.) Uses the config

View File

@ -137,7 +137,7 @@ If any of the TODOs you've added are important and should be fixed soon, you sho
## Type hints ## Type hints
We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation. We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation. Ideally when developing, run `mypy spacy` on the code base to inspect any issues.
If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values. If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
@ -155,6 +155,13 @@ def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
return callback return callback
``` ```
For typing variables, we prefer the explicit format.
```diff
- var = value # type: Type
+ var: Type = value
```
For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type). For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
```python ```python

View File

@ -0,0 +1,36 @@
# Explosion-bot
Explosion-bot is a robot that can be invoked to help with running particular test commands.
## Permissions
Only maintainers have permissions to summon explosion-bot. Each of the open source repos that use explosion-bot has its own team(s) of maintainers, and only github users who are members of those teams can successfully run bot commands.
## Running robot commands
To summon the robot, write a github comment on the issue/PR you wish to test. The comment must be in the following format:
```
@explosion-bot please test_gpu
```
Some things to note:
* The `@explosion-bot please` must be the beginning of the command - you cannot add anything in front of this or else the robot won't know how to parse it. Adding anything at the end aside from the test name will also confuse the robot, so keep it simple!
* The command name (such as `test_gpu`) must be one of the tests that the bot knows how to run. The available commands are documented in the bot's [workflow config](https://github.com/explosion/spaCy/blob/master/.github/workflows/explosionbot.yml#L26) and must match exactly one of the commands listed there.
* The robot can't do multiple things at once, so if you want it to run multiple tests, you'll have to summon it with one comment per test.
* For the `test_gpu` command, you can specify an optional thinc branch (from the spaCy repo) or a spaCy branch (from the thinc repo) with either the `--thinc-branch` or `--spacy-branch` flags. By default, the bot will pull in the PR branch from the repo where the command was issued, and the main branch of the other repository. However, if you need to run against another branch, you can say (for example):
```
@explosion-bot please test_gpu --thinc-branch develop
```
You can also specify a branch from an unmerged PR:
```
@explosion-bot please test_gpu --thinc-branch refs/pull/633/head
```
## Troubleshooting
If the robot isn't responding to commands as expected, you can check its logs in the [Github Action](https://github.com/explosion/spaCy/actions/workflows/explosionbot.yml).
For each command sent to the bot, there should be a run of the `explosion-bot` workflow. In the `Install and run explosion-bot` step, towards the ends of the logs you should see info about the configuration that the bot was run with, as well as any errors that the bot encountered.

View File

@ -5,7 +5,7 @@ requires = [
"cymem>=2.0.2,<2.1.0", "cymem>=2.0.2,<2.1.0",
"preshed>=3.0.2,<3.1.0", "preshed>=3.0.2,<3.1.0",
"murmurhash>=0.28.0,<1.1.0", "murmurhash>=0.28.0,<1.1.0",
"thinc>=8.0.12,<8.1.0", "thinc>=8.0.14,<8.1.0",
"blis>=0.4.0,<0.8.0", "blis>=0.4.0,<0.8.0",
"pathy", "pathy",
"numpy>=1.15.0", "numpy>=1.15.0",

View File

@ -1,14 +1,14 @@
# Our libraries # Our libraries
spacy-legacy>=3.0.8,<3.1.0 spacy-legacy>=3.0.9,<3.1.0
spacy-loggers>=1.0.0,<2.0.0 spacy-loggers>=1.0.0,<2.0.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.12,<8.1.0 thinc>=8.0.14,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
ml_datasets>=0.2.0,<0.3.0 ml_datasets>=0.2.0,<0.3.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.9.1,<1.1.0
srsly>=2.4.1,<3.0.0 srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5
@ -26,7 +26,7 @@ typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
# Development dependencies # Development dependencies
pre-commit>=2.13.0 pre-commit>=2.13.0
cython>=0.25,<3.0 cython>=0.25,<3.0
pytest>=5.2.0 pytest>=5.2.0,!=7.1.0
pytest-timeout>=1.3.0,<2.0.0 pytest-timeout>=1.3.0,<2.0.0
mock>=2.0.0,<3.0.0 mock>=2.0.0,<3.0.0
flake8>=3.8.0,<3.10.0 flake8>=3.8.0,<3.10.0
@ -35,3 +35,4 @@ mypy==0.910
types-dataclasses>=0.1.3; python_version < "3.7" types-dataclasses>=0.1.3; python_version < "3.7"
types-mock>=0.1.1 types-mock>=0.1.1
types-requests types-requests
black>=22.0,<23.0

View File

@ -38,18 +38,18 @@ setup_requires =
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
thinc>=8.0.12,<8.1.0 thinc>=8.0.14,<8.1.0
install_requires = install_requires =
# Our libraries # Our libraries
spacy-legacy>=3.0.8,<3.1.0 spacy-legacy>=3.0.9,<3.1.0
spacy-loggers>=1.0.0,<2.0.0 spacy-loggers>=1.0.0,<2.0.0
murmurhash>=0.28.0,<1.1.0 murmurhash>=0.28.0,<1.1.0
cymem>=2.0.2,<2.1.0 cymem>=2.0.2,<2.1.0
preshed>=3.0.2,<3.1.0 preshed>=3.0.2,<3.1.0
thinc>=8.0.12,<8.1.0 thinc>=8.0.14,<8.1.0
blis>=0.4.0,<0.8.0 blis>=0.4.0,<0.8.0
wasabi>=0.8.1,<1.1.0 wasabi>=0.9.1,<1.1.0
srsly>=2.4.1,<3.0.0 srsly>=2.4.3,<3.0.0
catalogue>=2.0.6,<2.1.0 catalogue>=2.0.6,<2.1.0
typer>=0.3.0,<0.5.0 typer>=0.3.0,<0.5.0
pathy>=0.3.5 pathy>=0.3.5

View File

@ -23,6 +23,7 @@ Options.docstrings = True
PACKAGES = find_packages() PACKAGES = find_packages()
MOD_NAMES = [ MOD_NAMES = [
"spacy.training.alignment_array",
"spacy.training.example", "spacy.training.example",
"spacy.parts_of_speech", "spacy.parts_of_speech",
"spacy.strings", "spacy.strings",
@ -33,6 +34,7 @@ MOD_NAMES = [
"spacy.ml.parser_model", "spacy.ml.parser_model",
"spacy.morphology", "spacy.morphology",
"spacy.pipeline.dep_parser", "spacy.pipeline.dep_parser",
"spacy.pipeline._edit_tree_internals.edit_trees",
"spacy.pipeline.morphologizer", "spacy.pipeline.morphologizer",
"spacy.pipeline.multitask", "spacy.pipeline.multitask",
"spacy.pipeline.ner", "spacy.pipeline.ner",
@ -81,7 +83,6 @@ COPY_FILES = {
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package", ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package", ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package", ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
} }

View File

@ -1,6 +1,6 @@
# fmt: off # fmt: off
__title__ = "spacy" __title__ = "spacy"
__version__ = "3.2.1" __version__ = "3.3.0"
__download_url__ = "https://github.com/explosion/spacy-models/releases/download" __download_url__ = "https://github.com/explosion/spacy-models/releases/download"
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
__projects__ = "https://github.com/explosion/projects" __projects__ = "https://github.com/explosion/projects"

View File

@ -14,6 +14,7 @@ from .pretrain import pretrain # noqa: F401
from .debug_data import debug_data # noqa: F401 from .debug_data import debug_data # noqa: F401
from .debug_config import debug_config # noqa: F401 from .debug_config import debug_config # noqa: F401
from .debug_model import debug_model # noqa: F401 from .debug_model import debug_model # noqa: F401
from .debug_diff import debug_diff # noqa: F401
from .evaluate import evaluate # noqa: F401 from .evaluate import evaluate # noqa: F401
from .convert import convert # noqa: F401 from .convert import convert # noqa: F401
from .init_pipeline import init_pipeline_cli # noqa: F401 from .init_pipeline import init_pipeline_cli # noqa: F401

View File

@ -360,7 +360,7 @@ def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False)
src = str(src) src = str(src)
with smart_open.open(src, mode="rb", ignore_ext=True) as input_file: with smart_open.open(src, mode="rb", ignore_ext=True) as input_file:
with dest.open(mode="wb") as output_file: with dest.open(mode="wb") as output_file:
output_file.write(input_file.read()) shutil.copyfileobj(input_file, output_file)
def ensure_pathy(path): def ensure_pathy(path):

View File

@ -19,6 +19,7 @@ from ..morphology import Morphology
from ..language import Language from ..language import Language
from ..util import registry, resolve_dot_names from ..util import registry, resolve_dot_names
from ..compat import Literal from ..compat import Literal
from ..vectors import Mode as VectorsMode
from .. import util from .. import util
@ -170,6 +171,14 @@ def debug_data(
show=verbose, show=verbose,
) )
if len(nlp.vocab.vectors): if len(nlp.vocab.vectors):
if nlp.vocab.vectors.mode == VectorsMode.floret:
msg.info(
f"floret vectors with {len(nlp.vocab.vectors)} vectors, "
f"{nlp.vocab.vectors_length} dimensions, "
f"{nlp.vocab.vectors.minn}-{nlp.vocab.vectors.maxn} char "
f"n-gram subwords"
)
else:
msg.info( msg.info(
f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} " f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
f"unique keys, {nlp.vocab.vectors_length} dimensions)" f"unique keys, {nlp.vocab.vectors_length} dimensions)"
@ -193,6 +202,70 @@ def debug_data(
else: else:
msg.info("No word vectors present in the package") msg.info("No word vectors present in the package")
if "spancat" in factory_names:
model_labels_spancat = _get_labels_from_spancat(nlp)
has_low_data_warning = False
has_no_neg_warning = False
msg.divider("Span Categorization")
msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
msg.text("Label counts in train data: ", show=verbose)
for spans_key, data_labels in gold_train_data["spancat"].items():
msg.text(
f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
show=verbose,
)
# Data checks: only take the spans keys in the actual spancat components
data_labels_in_component = {
spans_key: gold_train_data["spancat"][spans_key]
for spans_key in model_labels_spancat.keys()
}
for spans_key, data_labels in data_labels_in_component.items():
for label, count in data_labels.items():
# Check for missing labels
spans_key_in_model = spans_key in model_labels_spancat.keys()
if (spans_key_in_model) and (
label not in model_labels_spancat[spans_key]
):
msg.warn(
f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
"Performance may degrade after training."
)
# Check for low number of examples per label
if count <= NEW_LABEL_THRESHOLD:
msg.warn(
f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
)
has_low_data_warning = True
# Check for negative examples
with msg.loading("Analyzing label distribution..."):
neg_docs = _get_examples_without_label(
train_dataset, label, "spancat", spans_key
)
if neg_docs == 0:
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True
if has_low_data_warning:
msg.text(
f"To train a new span type, your data should include at "
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
show=verbose,
)
else:
msg.good("Good amount of examples for all labels")
if has_no_neg_warning:
msg.text(
"Training data should always include examples of spans "
"in context, as well as examples without a given span "
"type.",
show=verbose,
)
else:
msg.good("Examples without ocurrences available for all labels")
if "ner" in factory_names: if "ner" in factory_names:
# Get all unique NER labels present in the data # Get all unique NER labels present in the data
labels = set( labels = set(
@ -238,7 +311,7 @@ def debug_data(
has_low_data_warning = True has_low_data_warning = True
with msg.loading("Analyzing label distribution..."): with msg.loading("Analyzing label distribution..."):
neg_docs = _get_examples_without_label(train_dataset, label) neg_docs = _get_examples_without_label(train_dataset, label, "ner")
if neg_docs == 0: if neg_docs == 0:
msg.warn(f"No examples for texts WITHOUT new label '{label}'") msg.warn(f"No examples for texts WITHOUT new label '{label}'")
has_no_neg_warning = True has_no_neg_warning = True
@ -573,6 +646,7 @@ def _compile_gold(
"deps": Counter(), "deps": Counter(),
"words": Counter(), "words": Counter(),
"roots": Counter(), "roots": Counter(),
"spancat": dict(),
"ws_ents": 0, "ws_ents": 0,
"boundary_cross_ents": 0, "boundary_cross_ents": 0,
"n_words": 0, "n_words": 0,
@ -603,6 +677,7 @@ def _compile_gold(
if nlp.vocab.strings[word] not in nlp.vocab.vectors: if nlp.vocab.strings[word] not in nlp.vocab.vectors:
data["words_missing_vectors"].update([word]) data["words_missing_vectors"].update([word])
if "ner" in factory_names: if "ner" in factory_names:
sent_starts = eg.get_aligned_sent_starts()
for i, label in enumerate(eg.get_aligned_ner()): for i, label in enumerate(eg.get_aligned_ner()):
if label is None: if label is None:
continue continue
@ -612,10 +687,19 @@ def _compile_gold(
if label.startswith(("B-", "U-")): if label.startswith(("B-", "U-")):
combined_label = label.split("-")[1] combined_label = label.split("-")[1]
data["ner"][combined_label] += 1 data["ner"][combined_label] += 1
if gold[i].is_sent_start and label.startswith(("I-", "L-")): if sent_starts[i] == True and label.startswith(("I-", "L-")):
data["boundary_cross_ents"] += 1 data["boundary_cross_ents"] += 1
elif label == "-": elif label == "-":
data["ner"]["-"] += 1 data["ner"]["-"] += 1
if "spancat" in factory_names:
for span_key in list(eg.reference.spans.keys()):
if span_key not in data["spancat"]:
data["spancat"][span_key] = Counter()
for i, span in enumerate(eg.reference.spans[span_key]):
if span.label_ is None:
continue
else:
data["spancat"][span_key][span.label_] += 1
if "textcat" in factory_names or "textcat_multilabel" in factory_names: if "textcat" in factory_names or "textcat_multilabel" in factory_names:
data["cats"].update(gold.cats) data["cats"].update(gold.cats)
if any(val not in (0, 1) for val in gold.cats.values()): if any(val not in (0, 1) for val in gold.cats.values()):
@ -686,14 +770,28 @@ def _format_labels(
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)]) return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
def _get_examples_without_label(data: Sequence[Example], label: str) -> int: def _get_examples_without_label(
data: Sequence[Example],
label: str,
component: Literal["ner", "spancat"] = "ner",
spans_key: Optional[str] = "sc",
) -> int:
count = 0 count = 0
for eg in data: for eg in data:
if component == "ner":
labels = [ labels = [
label.split("-")[1] label.split("-")[1]
for label in eg.get_aligned_ner() for label in eg.get_aligned_ner()
if label not in ("O", "-", None) if label not in ("O", "-", None)
] ]
if component == "spancat":
labels = (
[span.label_ for span in eg.reference.spans[spans_key]]
if spans_key in eg.reference.spans
else []
)
if label not in labels: if label not in labels:
count += 1 count += 1
return count return count

89
spacy/cli/debug_diff.py Normal file
View File

@ -0,0 +1,89 @@
from typing import Optional
import typer
from wasabi import Printer, diff_strings, MarkdownRenderer
from pathlib import Path
from thinc.api import Config
from ._util import debug_cli, Arg, Opt, show_validation_error, parse_config_overrides
from ..util import load_config
from .init_config import init_config, Optimizations
@debug_cli.command(
"diff-config",
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
)
def debug_diff_cli(
# fmt: off
ctx: typer.Context,
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
compare_to: Optional[Path] = Opt(None, help="Path to a config file to diff against, or `None` to compare against default settings", exists=True, allow_dash=True),
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether the user config was optimized for efficiency or accuracy. Only relevant when comparing against the default config."),
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the original config can run on a GPU. Only relevant when comparing against the default config."),
pretraining: bool = Opt(False, "--pretraining", "--pt", help="Whether to compare on a config with pretraining involved. Only relevant when comparing against the default config."),
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues")
# fmt: on
):
"""Show a diff of a config file with respect to spaCy's defaults or another config file. If
additional settings were used in the creation of the config file, then you
must supply these as extra parameters to the command when comparing to the default settings. The generated diff
can also be used when posting to the discussion forum to provide more
information for the maintainers.
The `optimize`, `gpu`, and `pretraining` options are only relevant when
comparing against the default configuration (or specifically when `compare_to` is None).
DOCS: https://spacy.io/api/cli#debug-diff
"""
debug_diff(
config_path=config_path,
compare_to=compare_to,
gpu=gpu,
optimize=optimize,
pretraining=pretraining,
markdown=markdown,
)
def debug_diff(
config_path: Path,
compare_to: Optional[Path],
gpu: bool,
optimize: Optimizations,
pretraining: bool,
markdown: bool,
):
msg = Printer()
with show_validation_error(hint_fill=False):
user_config = load_config(config_path)
if compare_to:
other_config = load_config(compare_to)
else:
# Recreate a default config based from user's config
lang = user_config["nlp"]["lang"]
pipeline = list(user_config["nlp"]["pipeline"])
msg.info(f"Found user-defined language: '{lang}'")
msg.info(f"Found user-defined pipelines: {pipeline}")
other_config = init_config(
lang=lang,
pipeline=pipeline,
optimize=optimize.value,
gpu=gpu,
pretraining=pretraining,
silent=True,
)
user = user_config.to_str()
other = other_config.to_str()
if user == other:
msg.warn("No diff to show: configs are identical")
else:
diff_text = diff_strings(other, user, add_symbols=markdown)
if markdown:
md = MarkdownRenderer()
md.add(md.code_block(diff_text, "diff"))
print(md.text)
else:
print(diff_text)

View File

@ -7,6 +7,7 @@ from collections import defaultdict
from catalogue import RegistryError from catalogue import RegistryError
import srsly import srsly
import sys import sys
import re
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
from ..schemas import validate, ModelMetaSchema from ..schemas import validate, ModelMetaSchema
@ -109,6 +110,24 @@ def package(
", ".join(meta["requirements"]), ", ".join(meta["requirements"]),
) )
if name is not None: if name is not None:
if not name.isidentifier():
msg.fail(
f"Model name ('{name}') is not a valid module name. "
"This is required so it can be imported as a module.",
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
"and 0-9. "
"For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
exits=1,
)
if not _is_permitted_package_name(name):
msg.fail(
f"Model name ('{name}') is not a permitted package name. "
"This is required to correctly load the model with spacy.load.",
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
"and 0-9. "
"For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
exits=1,
)
meta["name"] = name meta["name"] = name
if version is not None: if version is not None:
meta["version"] = version meta["version"] = version
@ -162,7 +181,7 @@ def package(
imports="\n".join(f"from . import {m}" for m in imports) imports="\n".join(f"from . import {m}" for m in imports)
) )
create_file(package_path / "__init__.py", init_py) create_file(package_path / "__init__.py", init_py)
msg.good(f"Successfully created package '{model_name_v}'", main_path) msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
if create_sdist: if create_sdist:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "sdist"], capture=False) util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
@ -171,8 +190,14 @@ def package(
if create_wheel: if create_wheel:
with util.working_dir(main_path): with util.working_dir(main_path):
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False) util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}" wheel_name_squashed = re.sub("_+", "_", model_name_v)
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
msg.good(f"Successfully created binary wheel", wheel) msg.good(f"Successfully created binary wheel", wheel)
if "__" in model_name:
msg.warn(
f"Model name ('{model_name}') contains a run of underscores. "
"Runs of underscores are not significant in installed package names.",
)
def has_wheel() -> bool: def has_wheel() -> bool:
@ -422,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
return md.text return md.text
def _is_permitted_package_name(package_name: str) -> bool:
# regex from: https://www.python.org/dev/peps/pep-0426/#name
permitted_match = re.search(
r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
)
return permitted_match is not None
TEMPLATE_SETUP = """ TEMPLATE_SETUP = """
#!/usr/bin/env python #!/usr/bin/env python
import io import io

View File

@ -3,6 +3,7 @@ the docs and the init config command. It encodes various best practices and
can help generate the best possible configuration, given a user's requirements. #} can help generate the best possible configuration, given a user's requirements. #}
{%- set use_transformer = hardware != "cpu" -%} {%- set use_transformer = hardware != "cpu" -%}
{%- set transformer = transformer_data[optimize] if use_transformer else {} -%} {%- set transformer = transformer_data[optimize] if use_transformer else {} -%}
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "spancat", "trainable_lemmatizer"] -%}
[paths] [paths]
train = null train = null
dev = null dev = null
@ -24,10 +25,10 @@ lang = "{{ lang }}"
{%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%} {%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%}
{%- set with_accuracy = optimize == "accuracy" -%} {%- set with_accuracy = optimize == "accuracy" -%}
{%- set has_accurate_textcat = has_textcat and with_accuracy -%} {%- set has_accurate_textcat = has_textcat and with_accuracy -%}
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or has_accurate_textcat) -%} {%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "spancat" in components or "trainable_lemmatizer" in components or "entity_linker" in components or has_accurate_textcat) -%}
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %} {%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components -%}
{%- else -%} {%- else -%}
{%- set full_pipeline = components %} {%- set full_pipeline = components -%}
{%- endif %} {%- endif %}
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }} pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
batch_size = {{ 128 if hardware == "gpu" else 1000 }} batch_size = {{ 128 if hardware == "gpu" else 1000 }}
@ -54,7 +55,7 @@ stride = 96
factory = "morphologizer" factory = "morphologizer"
[components.morphologizer.model] [components.morphologizer.model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
nO = null nO = null
[components.morphologizer.model.tok2vec] [components.morphologizer.model.tok2vec]
@ -70,7 +71,7 @@ grad_factor = 1.0
factory = "tagger" factory = "tagger"
[components.tagger.model] [components.tagger.model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
nO = null nO = null
[components.tagger.model.tok2vec] [components.tagger.model.tok2vec]
@ -123,6 +124,60 @@ grad_factor = 1.0
@layers = "reduce_mean.v1" @layers = "reduce_mean.v1"
{% endif -%} {% endif -%}
{% if "spancat" in components -%}
[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.spancat.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]
{% endif -%}
{% if "trainable_lemmatizer" in components -%}
[components.trainable_lemmatizer]
factory = "trainable_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
top_k = 1
[components.trainable_lemmatizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false
[components.trainable_lemmatizer.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.trainable_lemmatizer.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
{% endif -%}
{% if "entity_linker" in components -%} {% if "entity_linker" in components -%}
[components.entity_linker] [components.entity_linker]
factory = "entity_linker" factory = "entity_linker"
@ -131,7 +186,7 @@ incl_context = true
incl_prior = true incl_prior = true
[components.entity_linker.model] [components.entity_linker.model]
@architectures = "spacy.EntityLinker.v1" @architectures = "spacy.EntityLinker.v2"
nO = null nO = null
[components.entity_linker.model.tok2vec] [components.entity_linker.model.tok2vec]
@ -238,7 +293,7 @@ maxout_pieces = 3
factory = "morphologizer" factory = "morphologizer"
[components.morphologizer.model] [components.morphologizer.model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
nO = null nO = null
[components.morphologizer.model.tok2vec] [components.morphologizer.model.tok2vec]
@ -251,7 +306,7 @@ width = ${components.tok2vec.model.encode.width}
factory = "tagger" factory = "tagger"
[components.tagger.model] [components.tagger.model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
nO = null nO = null
[components.tagger.model.tok2vec] [components.tagger.model.tok2vec]
@ -295,6 +350,54 @@ nO = null
width = ${components.tok2vec.model.encode.width} width = ${components.tok2vec.model.encode.width}
{% endif %} {% endif %}
{% if "spancat" in components %}
[components.spancat]
factory = "spancat"
max_positive = null
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
spans_key = "sc"
threshold = 0.5
[components.spancat.model]
@architectures = "spacy.SpanCategorizer.v1"
[components.spancat.model.reducer]
@layers = "spacy.mean_max_reducer.v1"
hidden_size = 128
[components.spancat.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = null
nI = null
[components.spancat.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[components.spancat.suggester]
@misc = "spacy.ngram_suggester.v1"
sizes = [1,2,3]
{% endif %}
{% if "trainable_lemmatizer" in components -%}
[components.trainable_lemmatizer]
factory = "trainable_lemmatizer"
backoff = "orth"
min_tree_freq = 3
overwrite = false
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
top_k = 1
[components.trainable_lemmatizer.model]
@architectures = "spacy.Tagger.v2"
nO = null
normalize = false
[components.trainable_lemmatizer.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
{% endif -%}
{% if "entity_linker" in components -%} {% if "entity_linker" in components -%}
[components.entity_linker] [components.entity_linker]
factory = "entity_linker" factory = "entity_linker"
@ -303,7 +406,7 @@ incl_context = true
incl_prior = true incl_prior = true
[components.entity_linker.model] [components.entity_linker.model]
@architectures = "spacy.EntityLinker.v1" @architectures = "spacy.EntityLinker.v2"
nO = null nO = null
[components.entity_linker.model.tok2vec] [components.entity_linker.model.tok2vec]
@ -369,7 +472,7 @@ no_output_layer = false
{% endif %} {% endif %}
{% for pipe in components %} {% for pipe in components %}
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %} {% if pipe not in listener_components %}
{# Other components defined by the user: we just assume they're factories #} {# Other components defined by the user: we just assume they're factories #}
[components.{{ pipe }}] [components.{{ pipe }}]
factory = "{{ pipe }}" factory = "{{ pipe }}"

View File

@ -7,7 +7,7 @@ USAGE: https://spacy.io/usage/visualizers
from typing import Union, Iterable, Optional, Dict, Any, Callable from typing import Union, Iterable, Optional, Dict, Any, Callable
import warnings import warnings
from .render import DependencyRenderer, EntityRenderer from .render import DependencyRenderer, EntityRenderer, SpanRenderer
from ..tokens import Doc, Span from ..tokens import Doc, Span
from ..errors import Errors, Warnings from ..errors import Errors, Warnings
from ..util import is_in_jupyter from ..util import is_in_jupyter
@ -44,6 +44,7 @@ def render(
factories = { factories = {
"dep": (DependencyRenderer, parse_deps), "dep": (DependencyRenderer, parse_deps),
"ent": (EntityRenderer, parse_ents), "ent": (EntityRenderer, parse_ents),
"span": (SpanRenderer, parse_spans),
} }
if style not in factories: if style not in factories:
raise ValueError(Errors.E087.format(style=style)) raise ValueError(Errors.E087.format(style=style))
@ -55,6 +56,10 @@ def render(
renderer_func, converter = factories[style] renderer_func, converter = factories[style]
renderer = renderer_func(options=options) renderer = renderer_func(options=options)
parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore
if manual:
for doc in docs:
if isinstance(doc, dict) and "ents" in doc:
doc["ents"] = sorted(doc["ents"], key=lambda x: (x["start"], x["end"]))
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore _html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore
html = _html["parsed"] html = _html["parsed"]
if RENDER_WRAPPER is not None: if RENDER_WRAPPER is not None:
@ -203,6 +208,42 @@ def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
return {"text": doc.text, "ents": ents, "title": title, "settings": settings} return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
def parse_spans(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
"""Generate spans in [{start: i, end: i, label: 'label'}] format.
doc (Doc): Document to parse.
options (Dict[str, any]): Span-specific visualisation options.
RETURNS (dict): Generated span types keyed by text (original text) and spans.
"""
kb_url_template = options.get("kb_url_template", None)
spans_key = options.get("spans_key", "sc")
spans = [
{
"start": span.start_char,
"end": span.end_char,
"start_token": span.start,
"end_token": span.end,
"label": span.label_,
"kb_id": span.kb_id_ if span.kb_id_ else "",
"kb_url": kb_url_template.format(span.kb_id_) if kb_url_template else "#",
}
for span in doc.spans[spans_key]
]
tokens = [token.text for token in doc]
if not spans:
warnings.warn(Warnings.W117.format(spans_key=spans_key))
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
settings = get_doc_settings(doc)
return {
"text": doc.text,
"spans": spans,
"title": title,
"settings": settings,
"tokens": tokens,
}
def set_render_wrapper(func: Callable[[str], str]) -> None: def set_render_wrapper(func: Callable[[str], str]) -> None:
"""Set an optional wrapper function that is called around the generated """Set an optional wrapper function that is called around the generated
HTML markup on displacy.render. This can be used to allow integration into HTML markup on displacy.render. This can be used to allow integration into

View File

@ -1,12 +1,15 @@
from typing import Dict, Any, List, Optional, Union from typing import Any, Dict, List, Optional, Tuple, Union
import uuid import uuid
import itertools
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
from .templates import TPL_ENTS, TPL_KB_LINK
from ..util import minify_html, escape_html, registry
from ..errors import Errors from ..errors import Errors
from ..util import escape_html, minify_html, registry
from .templates import TPL_DEP_ARCS, TPL_DEP_SVG, TPL_DEP_WORDS
from .templates import TPL_DEP_WORDS_LEMMA, TPL_ENT, TPL_ENT_RTL, TPL_ENTS
from .templates import TPL_FIGURE, TPL_KB_LINK, TPL_PAGE, TPL_SPAN
from .templates import TPL_SPAN_RTL, TPL_SPAN_SLICE, TPL_SPAN_SLICE_RTL
from .templates import TPL_SPAN_START, TPL_SPAN_START_RTL, TPL_SPANS
from .templates import TPL_TITLE
DEFAULT_LANG = "en" DEFAULT_LANG = "en"
DEFAULT_DIR = "ltr" DEFAULT_DIR = "ltr"
@ -33,6 +36,168 @@ DEFAULT_LABEL_COLORS = {
} }
class SpanRenderer:
"""Render Spans as SVGs."""
style = "span"
def __init__(self, options: Dict[str, Any] = {}) -> None:
"""Initialise span renderer
options (dict): Visualiser-specific options (colors, spans)
"""
# Set up the colors and overall look
colors = dict(DEFAULT_LABEL_COLORS)
user_colors = registry.displacy_colors.get_all()
for user_color in user_colors.values():
if callable(user_color):
# Since this comes from the function registry, we want to make
# sure we support functions that *return* a dict of colors
user_color = user_color()
if not isinstance(user_color, dict):
raise ValueError(Errors.E925.format(obj=type(user_color)))
colors.update(user_color)
colors.update(options.get("colors", {}))
self.default_color = DEFAULT_ENTITY_COLOR
self.colors = {label.upper(): color for label, color in colors.items()}
# Set up how the text and labels will be rendered
self.direction = DEFAULT_DIR
self.lang = DEFAULT_LANG
self.top_offset = options.get("top_offset", 40)
self.top_offset_step = options.get("top_offset_step", 17)
# Set up which templates will be used
template = options.get("template")
if template:
self.span_template = template["span"]
self.span_slice_template = template["slice"]
self.span_start_template = template["start"]
else:
if self.direction == "rtl":
self.span_template = TPL_SPAN_RTL
self.span_slice_template = TPL_SPAN_SLICE_RTL
self.span_start_template = TPL_SPAN_START_RTL
else:
self.span_template = TPL_SPAN
self.span_slice_template = TPL_SPAN_SLICE
self.span_start_template = TPL_SPAN_START
def render(
self, parsed: List[Dict[str, Any]], page: bool = False, minify: bool = False
) -> str:
"""Render complete markup.
parsed (list): Dependency parses to render.
page (bool): Render parses wrapped as full HTML page.
minify (bool): Minify HTML markup.
RETURNS (str): Rendered HTML markup.
"""
rendered = []
for i, p in enumerate(parsed):
if i == 0:
settings = p.get("settings", {})
self.direction = settings.get("direction", DEFAULT_DIR)
self.lang = settings.get("lang", DEFAULT_LANG)
rendered.append(self.render_spans(p["tokens"], p["spans"], p.get("title")))
if page:
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
else:
markup = "".join(rendered)
if minify:
return minify_html(markup)
return markup
def render_spans(
self,
tokens: List[str],
spans: List[Dict[str, Any]],
title: Optional[str],
) -> str:
"""Render span types in text.
Spans are rendered per-token, this means that for each token, we check if it's part
of a span slice (a member of a span type) or a span start (the starting token of a
given span type).
tokens (list): Individual tokens in the text
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
title (str / None): Document title set in Doc.user_data['title'].
"""
per_token_info = []
for idx, token in enumerate(tokens):
# Identify if a token belongs to a Span (and which) and if it's a
# start token of said Span. We'll use this for the final HTML render
token_markup: Dict[str, Any] = {}
token_markup["text"] = token
entities = []
for span in spans:
ent = {}
if span["start_token"] <= idx < span["end_token"]:
ent["label"] = span["label"]
ent["is_start"] = True if idx == span["start_token"] else False
kb_id = span.get("kb_id", "")
kb_url = span.get("kb_url", "#")
ent["kb_link"] = (
TPL_KB_LINK.format(kb_id=kb_id, kb_url=kb_url) if kb_id else ""
)
entities.append(ent)
token_markup["entities"] = entities
per_token_info.append(token_markup)
markup = self._render_markup(per_token_info)
markup = TPL_SPANS.format(content=markup, dir=self.direction)
if title:
markup = TPL_TITLE.format(title=title) + markup
return markup
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
"""Render the markup from per-token information"""
markup = ""
for token in per_token_info:
entities = sorted(token["entities"], key=lambda d: d["label"])
if entities:
slices = self._get_span_slices(token["entities"])
starts = self._get_span_starts(token["entities"])
markup += self.span_template.format(
text=token["text"], span_slices=slices, span_starts=starts
)
else:
markup += escape_html(token["text"] + " ")
return markup
def _get_span_slices(self, entities: List[Dict]) -> str:
"""Get the rendered markup of all Span slices"""
span_slices = []
for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
color = self.colors.get(entity["label"].upper(), self.default_color)
span_slice = self.span_slice_template.format(
bg=color, top_offset=self.top_offset + step
)
span_slices.append(span_slice)
return "".join(span_slices)
def _get_span_starts(self, entities: List[Dict]) -> str:
"""Get the rendered markup of all Span start tokens"""
span_starts = []
for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
color = self.colors.get(entity["label"].upper(), self.default_color)
span_start = (
self.span_start_template.format(
bg=color,
top_offset=self.top_offset + step,
label=entity["label"],
kb_link=entity["kb_link"],
)
if entity["is_start"]
else ""
)
span_starts.append(span_start)
return "".join(span_starts)
class DependencyRenderer: class DependencyRenderer:
"""Render dependency parses as SVGs.""" """Render dependency parses as SVGs."""
@ -105,7 +270,7 @@ class DependencyRenderer:
RETURNS (str): Rendered SVG markup. RETURNS (str): Rendered SVG markup.
""" """
self.levels = self.get_levels(arcs) self.levels = self.get_levels(arcs)
self.highest_level = len(self.levels) self.highest_level = max(self.levels.values(), default=0)
self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke
self.width = self.offset_x + len(words) * self.distance self.width = self.offset_x + len(words) * self.distance
self.height = self.offset_y + 3 * self.word_spacing self.height = self.offset_y + 3 * self.word_spacing
@ -165,7 +330,7 @@ class DependencyRenderer:
if start < 0 or end < 0: if start < 0 or end < 0:
error_args = dict(start=start, end=end, label=label, dir=direction) error_args = dict(start=start, end=end, label=label, dir=direction)
raise ValueError(Errors.E157.format(**error_args)) raise ValueError(Errors.E157.format(**error_args))
level = self.levels.index(end - start) + 1 level = self.levels[(start, end, label)]
x_start = self.offset_x + start * self.distance + self.arrow_spacing x_start = self.offset_x + start * self.distance + self.arrow_spacing
if self.direction == "rtl": if self.direction == "rtl":
x_start = self.width - x_start x_start = self.width - x_start
@ -181,7 +346,7 @@ class DependencyRenderer:
y_curve = self.offset_y - level * self.distance / 2 y_curve = self.offset_y - level * self.distance / 2
if self.compact: if self.compact:
y_curve = self.offset_y - level * self.distance / 6 y_curve = self.offset_y - level * self.distance / 6
if y_curve == 0 and len(self.levels) > 5: if y_curve == 0 and max(self.levels.values(), default=0) > 5:
y_curve = -self.distance y_curve = -self.distance
arrowhead = self.get_arrowhead(direction, x_start, y, x_end) arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
arc = self.get_arc(x_start, y, y_curve, x_end) arc = self.get_arc(x_start, y, y_curve, x_end)
@ -225,15 +390,23 @@ class DependencyRenderer:
p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2) p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2)
return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}" return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}"
def get_levels(self, arcs: List[Dict[str, Any]]) -> List[int]: def get_levels(self, arcs: List[Dict[str, Any]]) -> Dict[Tuple[int, int, str], int]:
"""Calculate available arc height "levels". """Calculate available arc height "levels".
Used to calculate arrow heights dynamically and without wasting space. Used to calculate arrow heights dynamically and without wasting space.
args (list): Individual arcs and their start, end, direction and label. args (list): Individual arcs and their start, end, direction and label.
RETURNS (list): Arc levels sorted from lowest to highest. RETURNS (dict): Arc levels keyed by (start, end, label).
""" """
levels = set(map(lambda arc: arc["end"] - arc["start"], arcs)) arcs = [dict(t) for t in {tuple(sorted(arc.items())) for arc in arcs}]
return sorted(list(levels)) length = max([arc["end"] for arc in arcs], default=0)
max_level = [0] * length
levels = {}
for arc in sorted(arcs, key=lambda arc: arc["end"] - arc["start"]):
level = max(max_level[arc["start"] : arc["end"]]) + 1
for i in range(arc["start"], arc["end"]):
max_level[i] = level
levels[(arc["start"], arc["end"], arc["label"])] = level
return levels
class EntityRenderer: class EntityRenderer:
@ -242,7 +415,7 @@ class EntityRenderer:
style = "ent" style = "ent"
def __init__(self, options: Dict[str, Any] = {}) -> None: def __init__(self, options: Dict[str, Any] = {}) -> None:
"""Initialise dependency renderer. """Initialise entity renderer.
options (dict): Visualiser-specific options (colors, ents) options (dict): Visualiser-specific options (colors, ents)
""" """

View File

@ -62,6 +62,55 @@ TPL_ENT_RTL = """
</mark> </mark>
""" """
TPL_SPANS = """
<div class="spans" style="line-height: 2.5; direction: {dir}">{content}</div>
"""
TPL_SPAN = """
<span style="font-weight: bold; display: inline-block; position: relative;">
{text}
{span_slices}
{span_starts}
</span>
"""
TPL_SPAN_SLICE = """
<span style="background: {bg}; top: {top_offset}px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;">
</span>
"""
TPL_SPAN_START = """
<span style="background: {bg}; top: {top_offset}px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;">
<span style="background: {bg}; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px">
{label}{kb_link}
</span>
</span>
"""
TPL_SPAN_RTL = """
<span style="font-weight: bold; display: inline-block; position: relative;">
{text}
{span_slices}
{span_starts}
</span>
"""
TPL_SPAN_SLICE_RTL = """
<span style="background: {bg}; top: {top_offset}px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;">
</span>
"""
TPL_SPAN_START_RTL = """
<span style="background: {bg}; top: {top_offset}px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;">
<span style="background: {bg}; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px">
{label}{kb_link}
</span>
</span>
"""
# Important: this needs to start with a space! # Important: this needs to start with a space!
TPL_KB_LINK = """ TPL_KB_LINK = """
<a style="text-decoration: none; color: inherit; font-weight: normal" href="{kb_url}">{kb_id}</a> <a style="text-decoration: none; color: inherit; font-weight: normal" href="{kb_url}">{kb_id}</a>

View File

@ -192,6 +192,13 @@ class Warnings(metaclass=ErrorsWithCodes):
W115 = ("Skipping {method}: the floret vector table cannot be modified. " W115 = ("Skipping {method}: the floret vector table cannot be modified. "
"Vectors are calculated from character ngrams.") "Vectors are calculated from character ngrams.")
W116 = ("Unable to clean attribute '{attr}'.") W116 = ("Unable to clean attribute '{attr}'.")
W117 = ("No spans to visualize found in Doc object with spans_key: '{spans_key}'. If this is "
"surprising to you, make sure the Doc was processed using a model "
"that supports span categorization, and check the `doc.spans[spans_key]` "
"property manually if necessary.")
W118 = ("Term '{term}' not found in glossary. It may however be explained in documentation "
"for the corpora used to train the language. Please check "
"`nlp.meta[\"sources\"]` for any relevant links.")
class Errors(metaclass=ErrorsWithCodes): class Errors(metaclass=ErrorsWithCodes):
@ -483,7 +490,7 @@ class Errors(metaclass=ErrorsWithCodes):
"components, since spans are only views of the Doc. Use Doc and " "components, since spans are only views of the Doc. Use Doc and "
"Token attributes (or custom extension attributes) only and remove " "Token attributes (or custom extension attributes) only and remove "
"the following: {attrs}") "the following: {attrs}")
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. " E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
"Only Doc and Token attributes are supported.") "Only Doc and Token attributes are supported.")
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget " E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
"to define the attribute? For example: `{attr}.???`") "to define the attribute? For example: `{attr}.???`")
@ -520,10 +527,14 @@ class Errors(metaclass=ErrorsWithCodes):
E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.") E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
# New errors added in v3.x # New errors added in v3.x
E855 = ("Invalid {obj}: {obj} is not from the same doc.")
E856 = ("Error accessing span at position {i}: out of bounds in span group "
"of length {length}.")
E857 = ("Entry '{name}' not found in edit tree lemmatizer labels.")
E858 = ("The {mode} vector table does not support this operation. " E858 = ("The {mode} vector table does not support this operation. "
"{alternative}") "{alternative}")
E859 = ("The floret vector table cannot be modified.") E859 = ("The floret vector table cannot be modified.")
E860 = ("Can't truncate fasttext-bloom vectors.") E860 = ("Can't truncate floret vectors.")
E861 = ("No 'keys' should be provided when initializing floret vectors " E861 = ("No 'keys' should be provided when initializing floret vectors "
"with 'minn' and 'maxn'.") "with 'minn' and 'maxn'.")
E862 = ("'hash_count' must be between 1-4 for floret vectors.") E862 = ("'hash_count' must be between 1-4 for floret vectors.")
@ -566,9 +577,6 @@ class Errors(metaclass=ErrorsWithCodes):
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to " E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
"a list of spans, with each span represented by a tuple (start_char, end_char). " "a list of spans, with each span represented by a tuple (start_char, end_char). "
"The tuple can be optionally extended with a label and a KB ID.") "The tuple can be optionally extended with a label and a KB ID.")
E880 = ("The 'wandb' library could not be found - did you install it? "
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
"config section, instead of the 'WandbLogger'.")
E884 = ("The pipeline could not be initialized because the vectors " E884 = ("The pipeline could not be initialized because the vectors "
"could not be found at '{vectors}'. If your pipeline was already " "could not be found at '{vectors}'. If your pipeline was already "
"initialized/trained before, call 'resume_training' instead of 'initialize', " "initialized/trained before, call 'resume_training' instead of 'initialize', "
@ -894,6 +902,17 @@ class Errors(metaclass=ErrorsWithCodes):
"patterns.") "patterns.")
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only " E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
"supported values are: 'I', 'O', 'B' and ''") "supported values are: 'I', 'O', 'B' and ''")
E1026 = ("Edit tree has an invalid format:\n{errors}")
E1027 = ("AlignmentArray only supports slicing with a step of 1.")
E1028 = ("AlignmentArray only supports indexing using an int or a slice.")
E1029 = ("Edit tree cannot be applied to form.")
E1030 = ("Edit tree identifier out of range.")
E1031 = ("Could not find gold transition - see logs above.")
E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
E1034 = ("Node index {i} out of bounds ({length})")
E1035 = ("Token index {i} out of bounds ({length})")
E1036 = ("Cannot index into NoneNode")
# Deprecated model shortcuts, only used in errors and warnings # Deprecated model shortcuts, only used in errors and warnings

View File

@ -1,3 +1,7 @@
import warnings
from .errors import Warnings
def explain(term): def explain(term):
"""Get a description for a given POS tag, dependency label or entity type. """Get a description for a given POS tag, dependency label or entity type.
@ -11,6 +15,8 @@ def explain(term):
""" """
if term in GLOSSARY: if term in GLOSSARY:
return GLOSSARY[term] return GLOSSARY[term]
else:
warnings.warn(Warnings.W118.format(term=term))
GLOSSARY = { GLOSSARY = {
@ -310,7 +316,6 @@ GLOSSARY = {
"re": "repeated element", "re": "repeated element",
"rs": "reported speech", "rs": "reported speech",
"sb": "subject", "sb": "subject",
"sb": "subject",
"sbp": "passivized subject (PP)", "sbp": "passivized subject (PP)",
"sp": "subject or predicate", "sp": "subject or predicate",
"svp": "separable verb prefix", "svp": "separable verb prefix",

View File

@ -0,0 +1,16 @@
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
from ...language import Language, BaseDefaults
class LowerSorbianDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
class LowerSorbian(Language):
lang = "dsb"
Defaults = LowerSorbianDefaults
__all__ = ["LowerSorbian"]

View File

@ -0,0 +1,15 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.dsb.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Z tym stwori so wuměnjenje a zakład za dalše wobdźěłanje přez analyzu tekstoweje struktury a semantisku anotaciju a z tym tež za tu předstajenu digitalnu online-wersiju.",
"Mi so tu jara derje spodoba.",
"Kotre nowniny chceće měć?",
"Tak ako w slědnem lěśe jo teke lětosa jano doma zapustowaś móžno.",
"Zwóstanjo pótakem hyšći wjele źěła.",
]

113
spacy/lang/dsb/lex_attrs.py Normal file
View File

@ -0,0 +1,113 @@
from ...attrs import LIKE_NUM
_num_words = [
"nul",
"jaden",
"jadna",
"jadno",
"dwa",
"dwě",
"tśi",
"tśo",
"styri",
"styrjo",
"pěś",
"pěśo",
"šesć",
"šesćo",
"sedym",
"sedymjo",
"wósym",
"wósymjo",
"źewjeś",
"źewjeśo",
"źaseś",
"źaseśo",
"jadnassćo",
"dwanassćo",
"tśinasćo",
"styrnasćo",
"pěśnasćo",
"šesnasćo",
"sedymnasćo",
"wósymnasćo",
"źewjeśnasćo",
"dwanasćo",
"dwaźasća",
"tśiźasća",
"styrźasća",
"pěśźaset",
"šesćźaset",
"sedymźaset",
"wósymźaset",
"źewjeśźaset",
"sto",
"tysac",
"milion",
"miliarda",
"bilion",
"biliarda",
"trilion",
"triliarda",
]
_ordinal_words = [
"prědny",
"prědna",
"prědne",
"drugi",
"druga",
"druge",
"tśeśi",
"tśeśa",
"tśeśe",
"stwórty",
"stwórta",
"stwórte",
"pêty",
"pěta",
"pête",
"šesty",
"šesta",
"šeste",
"sedymy",
"sedyma",
"sedyme",
"wósymy",
"wósyma",
"wósyme",
"źewjety",
"źewjeta",
"źewjete",
"źasety",
"źaseta",
"źasete",
"jadnasty",
"jadnasta",
"jadnaste",
"dwanasty",
"dwanasta",
"dwanaste",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,15 @@
STOP_WORDS = set(
"""
a abo aby ako ale
daniž dokulaž
gaž
jolic
pak pótom
teke togodla
""".split()
)

View File

@ -447,7 +447,6 @@ for exc_data in [
{ORTH: "La.", NORM: "Louisiana"}, {ORTH: "La.", NORM: "Louisiana"},
{ORTH: "Mar.", NORM: "March"}, {ORTH: "Mar.", NORM: "March"},
{ORTH: "Mass.", NORM: "Massachusetts"}, {ORTH: "Mass.", NORM: "Massachusetts"},
{ORTH: "May.", NORM: "May"},
{ORTH: "Mich.", NORM: "Michigan"}, {ORTH: "Mich.", NORM: "Michigan"},
{ORTH: "Minn.", NORM: "Minnesota"}, {ORTH: "Minn.", NORM: "Minnesota"},
{ORTH: "Miss.", NORM: "Mississippi"}, {ORTH: "Miss.", NORM: "Mississippi"},

View File

@ -9,14 +9,14 @@ Example sentences to test spaCy and its language models.
sentences = [ sentences = [
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.", "Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.", "Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
"San Francisco analiza prohibir los robots delivery.", "San Francisco analiza prohibir los robots de reparto.",
"Londres es una gran ciudad del Reino Unido.", "Londres es una gran ciudad del Reino Unido.",
"El gato come pescado.", "El gato come pescado.",
"Veo al hombre con el telescopio.", "Veo al hombre con el telescopio.",
"La araña come moscas.", "La araña come moscas.",
"El pingüino incuba en su nido sobre el hielo.", "El pingüino incuba en su nido sobre el hielo.",
"¿Dónde estais?", "¿Dónde estáis?",
"¿Quién es el presidente Francés?", "¿Quién es el presidente francés?",
"¿Dónde está encuentra la capital de Argentina?", "¿Dónde se encuentra la capital de Argentina?",
"¿Cuándo nació José de San Martín?", "¿Cuándo nació José de San Martín?",
] ]

View File

@ -1,82 +1,80 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí a acuerdo adelante ademas además afirmó agregó ahi ahora ahí al algo alguna
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos algunas alguno algunos algún alli allí alrededor ambos ante anterior antes
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas apenas aproximadamente aquel aquella aquellas aquello aquellos aqui aquél
aquello aquellos aqui aquél aquélla aquéllas aquéllos aquí arriba arribaabajo aquélla aquéllas aquéllos aquí arriba aseguró asi así atras aun aunque añadió
aseguró asi así atras aun aunque ayer añadió aún aún
bajo bastante bien breve buen buena buenas bueno buenos bajo bastante bien breve buen buena buenas bueno buenos
cada casi cerca cierta ciertas cierto ciertos cinco claro comentó como con cada casi cierta ciertas cierto ciertos cinco claro comentó como con conmigo
conmigo conocer conseguimos conseguir considera consideró consigo consigue conocer conseguimos conseguir considera consideró consigo consigue consiguen
consiguen consigues contigo contra cosas creo cual cuales cualquier cuando consigues contigo contra creo cual cuales cualquier cuando cuanta cuantas
cuanta cuantas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas cuánto cuántos
cuánto cuántos cómo cómo
da dado dan dar de debajo debe deben debido decir dejó del delante demasiado da dado dan dar de debajo debe deben debido decir dejó del delante demasiado
demás dentro deprisa desde despacio despues después detras detrás dia dias dice demás dentro deprisa desde despacio despues después detras detrás dia dias dice
dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante día dicen dicho dieron diez diferente diferentes dijeron dijo dio doce donde dos
días dónde durante día días dónde
ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas e el ella ellas ello ellos embargo en encima encuentra enfrente enseguida
empleo en encima encuentra enfrente enseguida entonces entre era eramos eran entonces entre era eramos eran eras eres es esa esas ese eso esos esta estaba
eras eres es esa esas ese eso esos esta estaba estaban estado estados estais estaban estado estados estais estamos estan estar estará estas este esto estos
estamos estan estar estará estas este esto estos estoy estuvo está están ex estoy estuvo está están excepto existe existen explicó expresó él ésa ésas ése
excepto existe existen explicó expresó él ésa ésas ése ésos ésta éstas éste ésos ésta éstas éste éstos
éstos
fin final fue fuera fueron fui fuimos fin final fue fuera fueron fui fuimos
general gran grandes gueno gran grande grandes
ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
hizo horas hoy hubo hizo hoy hubo
igual incluso indicó informo informó intenta intentais intentamos intentan igual incluso indicó informo informó ir
intentar intentas intento ir
junto junto
la lado largo las le lejos les llegó lleva llevar lo los luego lugar la lado largo las le les llegó lleva llevar lo los luego
mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi
mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha mia mias mientras mio mios mis misma mismas mismo mismos modo mucha muchas
muchas mucho muchos muy más mía mías mío míos mucho muchos muy más mía mías mío míos
nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros
nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca nuestra nuestras nuestro nuestros nueva nuevas nueve nuevo nuevos nunca
ocho os otra otras otro otros o ocho once os otra otras otro otros
pais para parece parte partir pasada pasado paìs peor pero pesar poca pocas para parece parte partir pasada pasado paìs peor pero pesar poca pocas poco
poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrá pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
podrán podría podrían poner por porque posible primer primera primero primeros podrán podría podrían poner por porque posible primer primera primero primeros
principalmente pronto propia propias propio propios proximo próximo próximos pronto propia propias propio propios proximo próximo próximos pudo pueda puede
pudo pueda puede pueden puedo pues pueden puedo pues
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién quiénes qué qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién
quiénes qué
raras realizado realizar realizó repente respecto realizado realizar realizó repente respecto
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
según seis ser sera será serán sería señaló si sido siempre siendo siete sigue según seis ser sera será serán sería señaló si sido siempre siendo siete sigue
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy siguiente sin sino sobre sois sola solamente solas solo solos somos son soy su
soyos su supuesto sus suya suyas suyo sólo supuesto sus suya suyas suyo suyos sólo
tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis
tenemos tener tenga tengo tenido tenía tercera ti tiempo tiene tienen toda tenemos tener tenga tengo tenido tenía tercera tercero ti tiene tienen toda
todas todavia todavía todo todos total trabaja trabajais trabajamos trabajan todas todavia todavía todo todos total tras trata través tres tu tus tuvo tuya
trabajar trabajas trabajo tras trata través tres tu tus tuvo tuya tuyas tuyo tuyas tuyo tuyos
tuyos
ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes u ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
última últimas último últimos última últimas último últimos
va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero va vais vamos van varias varios vaya veces ver verdad verdadera verdadero vez
vez vosotras vosotros voy vuestra vuestras vuestro vuestros vosotras vosotros voy vuestra vuestras vuestro vuestros
ya yo y ya yo
""".split() """.split()
) )

View File

@ -3,7 +3,7 @@ from ...attrs import LIKE_NUM
_num_words = set( _num_words = set(
""" """
zero un deux trois quatre cinq six sept huit neuf dix zero un une deux trois quatre cinq six sept huit neuf dix
onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf
vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante
cent mille mil million milliard billion quadrillion quintillion cent mille mil million milliard billion quadrillion quintillion
@ -13,7 +13,7 @@ sextillion septillion octillion nonillion decillion
_ordinal_words = set( _ordinal_words = set(
""" """
premier deuxième second troisième quatrième cinquième sixième septième huitième neuvième dixième premier première deuxième second seconde troisième quatrième cinquième sixième septième huitième neuvième dixième
onzième douzième treizième quatorzième quinzième seizième dix-septième dix-huitième dix-neuvième onzième douzième treizième quatorzième quinzième seizième dix-septième dix-huitième dix-neuvième
vingtième trentième quarantième cinquantième soixantième soixante-dixième septantième quatre-vingtième huitantième quatre-vingt-dixième nonantième vingtième trentième quarantième cinquantième soixantième soixante-dixième septantième quatre-vingtième huitantième quatre-vingt-dixième nonantième
centième millième millionnième milliardième billionnième quadrillionnième quintillionnième centième millième millionnième milliardième billionnième quadrillionnième quintillionnième

View File

@ -64,9 +64,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
prev_end = right_end.i prev_end = right_end.i
left_index = word.left_edge.i left_index = word.left_edge.i
left_index = ( left_index = left_index + 1 if word.left_edge.pos == adp_pos else left_index
left_index + 1 if word.left_edge.pos == adp_pos else left_index
)
yield left_index, right_end.i + 1, np_label yield left_index, right_end.i + 1, np_label
elif word.dep == conj_label: elif word.dep == conj_label:

View File

@ -0,0 +1,18 @@
from .lex_attrs import LEX_ATTRS
from .stop_words import STOP_WORDS
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
from ...language import Language, BaseDefaults
class UpperSorbianDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
class UpperSorbian(Language):
lang = "hsb"
Defaults = UpperSorbianDefaults
__all__ = ["UpperSorbian"]

View File

@ -0,0 +1,15 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.hsb.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"To běšo wjelgin raźone a jo se wót luźi derje pśiwzeło. Tak som dožywiła wjelgin",
"Jogo pśewóźowarce stej groniłej, až how w serbskich stronach njama Santa Claus nic pytaś.",
"A ten sobuźěłaśeŕ Statneje biblioteki w Barlinju jo pśimjeł drogotne knigły bźez rukajcowu z nagima rukoma!",
"Take wobchadanje z našym kulturnym derbstwom zewšym njejźo.",
"Wopśimjeśe drugich pśinoskow jo było na wusokem niwowje, ako pśecej.",
]

106
spacy/lang/hsb/lex_attrs.py Normal file
View File

@ -0,0 +1,106 @@
from ...attrs import LIKE_NUM
_num_words = [
"nul",
"jedyn",
"jedna",
"jedne",
"dwaj",
"dwě",
"tři",
"třo",
"štyri",
"štyrjo",
"pjeć",
"šěsć",
"sydom",
"wosom",
"dźewjeć",
"dźesać",
"jědnaće",
"dwanaće",
"třinaće",
"štyrnaće",
"pjatnaće",
"šěsnaće",
"sydomnaće",
"wosomnaće",
"dźewjatnaće",
"dwaceći",
"třiceći",
"štyrceći",
"pjećdźesat",
"šěsćdźesat",
"sydomdźesat",
"wosomdźesat",
"dźewjećdźesat",
"sto",
"tysac",
"milion",
"miliarda",
"bilion",
"biliarda",
"trilion",
"triliarda",
]
_ordinal_words = [
"prěni",
"prěnja",
"prěnje",
"druhi",
"druha",
"druhe",
"třeći",
"třeća",
"třeće",
"štwórty",
"štwórta",
"štwórte",
"pjaty",
"pjata",
"pjate",
"šěsty",
"šěsta",
"šěste",
"sydmy",
"sydma",
"sydme",
"wosmy",
"wosma",
"wosme",
"dźewjaty",
"dźewjata",
"dźewjate",
"dźesaty",
"dźesata",
"dźesate",
"jědnaty",
"jědnata",
"jědnate",
"dwanaty",
"dwanata",
"dwanate",
]
def like_num(text):
if text.startswith(("+", "-", "±", "~")):
text = text[1:]
text = text.replace(",", "").replace(".", "")
if text.isdigit():
return True
if text.count("/") == 1:
num, denom = text.split("/")
if num.isdigit() and denom.isdigit():
return True
text_lower = text.lower()
if text_lower in _num_words:
return True
# Check ordinal number
if text_lower in _ordinal_words:
return True
return False
LEX_ATTRS = {LIKE_NUM: like_num}

View File

@ -0,0 +1,19 @@
STOP_WORDS = set(
"""
a abo ale ani
dokelž
hdyž
jeli jelizo
kaž
pak potom
tež tohodla
zo zoby
""".split()
)

View File

@ -0,0 +1,18 @@
from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...symbols import ORTH, NORM
from ...util import update_exc
_exc = dict()
for exc_data in [
{ORTH: "mil.", NORM: "milion"},
{ORTH: "wob.", NORM: "wobydler"},
]:
_exc[exc_data[ORTH]] = [exc_data]
for orth in [
"resp.",
]:
_exc[orth] = [{ORTH: orth}]
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)

View File

@ -1,12 +1,13 @@
from typing import Iterator, Any, Dict from typing import Iterator, Any, Dict
from .punctuation import TOKENIZER_INFIXES
from .stop_words import STOP_WORDS from .stop_words import STOP_WORDS
from .tag_map import TAG_MAP from .tag_map import TAG_MAP
from .lex_attrs import LEX_ATTRS from .lex_attrs import LEX_ATTRS
from ...language import Language, BaseDefaults from ...language import Language, BaseDefaults
from ...tokens import Doc from ...tokens import Doc
from ...scorer import Scorer from ...scorer import Scorer
from ...symbols import POS from ...symbols import POS, X
from ...training import validate_examples from ...training import validate_examples
from ...util import DummyTokenizer, registry, load_config_from_str from ...util import DummyTokenizer, registry, load_config_from_str
from ...vocab import Vocab from ...vocab import Vocab
@ -31,15 +32,24 @@ def create_tokenizer():
class KoreanTokenizer(DummyTokenizer): class KoreanTokenizer(DummyTokenizer):
def __init__(self, vocab: Vocab): def __init__(self, vocab: Vocab):
self.vocab = vocab self.vocab = vocab
MeCab = try_mecab_import() # type: ignore[func-returns-value] self._mecab = try_mecab_import() # type: ignore[func-returns-value]
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]") self._mecab_tokenizer = None
@property
def mecab_tokenizer(self):
# This is a property so that initializing a pipeline with blank:ko is
# possible without actually requiring mecab-ko, e.g. to run
# `spacy init vectors ko` for a pipeline that will have a different
# tokenizer in the end. The languages need to match for the vectors
# to be imported and there's no way to pass a custom config to
# `init vectors`.
if self._mecab_tokenizer is None:
self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
return self._mecab_tokenizer
def __reduce__(self): def __reduce__(self):
return KoreanTokenizer, (self.vocab,) return KoreanTokenizer, (self.vocab,)
def __del__(self):
self.mecab_tokenizer.__del__()
def __call__(self, text: str) -> Doc: def __call__(self, text: str) -> Doc:
dtokens = list(self.detailed_tokens(text)) dtokens = list(self.detailed_tokens(text))
surfaces = [dt["surface"] for dt in dtokens] surfaces = [dt["surface"] for dt in dtokens]
@ -47,7 +57,10 @@ class KoreanTokenizer(DummyTokenizer):
for token, dtoken in zip(doc, dtokens): for token, dtoken in zip(doc, dtokens):
first_tag, sep, eomi_tags = dtoken["tag"].partition("+") first_tag, sep, eomi_tags = dtoken["tag"].partition("+")
token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미) token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미)
if token.tag_ in TAG_MAP:
token.pos = TAG_MAP[token.tag_][POS] token.pos = TAG_MAP[token.tag_][POS]
else:
token.pos = X
token.lemma_ = dtoken["lemma"] token.lemma_ = dtoken["lemma"]
doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens] doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
return doc return doc
@ -76,6 +89,7 @@ class KoreanDefaults(BaseDefaults):
lex_attr_getters = LEX_ATTRS lex_attr_getters = LEX_ATTRS
stop_words = STOP_WORDS stop_words = STOP_WORDS
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False} writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
infixes = TOKENIZER_INFIXES
class Korean(Language): class Korean(Language):
@ -90,7 +104,8 @@ def try_mecab_import() -> None:
return MeCab return MeCab
except ImportError: except ImportError:
raise ImportError( raise ImportError(
"Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), " 'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), " "[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
"and [natto-py](https://github.com/buruzaemon/natto-py)" "and [natto-py](https://github.com/buruzaemon/natto-py)"
) from None ) from None

View File

@ -0,0 +1,12 @@
from ..char_classes import LIST_QUOTES
from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES
_infixes = (
["·", "", "\(", "\)"]
+ [r"(?<=[0-9])~(?=[0-9-])"]
+ LIST_QUOTES
+ BASE_TOKENIZER_INFIXES
)
TOKENIZER_INFIXES = _infixes

View File

@ -1,56 +1,219 @@
from ...attrs import LIKE_NUM from ...attrs import LIKE_NUM
_num_words = [ _num_words = list(
"ноль", set(
"один", """
"два", ноль ноля нолю нолём ноле нулевой нулевого нулевому нулевым нулевом нулевая нулевую нулевое нулевые нулевых нулевыми
"три",
"четыре", четверть четверти четвертью четвертей четвертям четвертями четвертях
"пять",
"шесть", треть трети третью третей третям третями третях
"семь",
"восемь", половина половины половине половину половиной половин половинам половинами половинах половиною
"девять",
"десять", один одного одному одним одном
"одиннадцать", первой первого первому первом первый первым первых
"двенадцать", во-первых
"тринадцать", единица единицы единице единицу единицей единиц единицам единицами единицах единицею
"четырнадцать",
"пятнадцать", два двумя двум двух двоих двое две
"шестнадцать", второго второму второй втором вторым вторых
"семнадцать", двойка двойки двойке двойку двойкой двоек двойкам двойками двойках двойкою
"восемнадцать", во-вторых
"девятнадцать", оба обе обеим обеими обеих обоим обоими обоих
"двадцать",
"тридцать", полтора полторы полутора
"сорок",
"пятьдесят", три третьего третьему третьем третьим третий тремя трем трех трое троих трёх
"шестьдесят", тройка тройки тройке тройку тройкою троек тройкам тройками тройках тройкой
"семьдесят", троечка троечки троечке троечку троечкой троечек троечкам троечками троечках троечкой
"восемьдесят", трешка трешки трешке трешку трешкой трешек трешкам трешками трешках трешкою
"девяносто", трёшка трёшки трёшке трёшку трёшкой трёшек трёшкам трёшками трёшках трёшкою
"сто", трояк трояка трояку трояком трояке трояки трояков троякам трояками трояках
"двести", треха треху трехой
"триста", трёха трёху трёхой
"четыреста", втроем втроём
"пятьсот",
"шестьсот", четыре четвертого четвертому четвертом четвертый четвертым четверка четырьмя четырем четырех четверо четырёх четверым
"семьсот", четверых
"восемьсот", вчетвером
"девятьсот",
"тысяча", пять пятого пятому пятом пятый пятым пятью пяти пятеро пятерых пятерыми
"миллион", впятером
"миллиард", пятерочка пятерочки пятерочке пятерочками пятерочкой пятерочку пятерочкой пятерочками
"триллион", пятёрочка пятёрочки пятёрочке пятёрочками пятёрочкой пятёрочку пятёрочкой пятёрочками
"квадриллион", пятерка пятерки пятерке пятерками пятеркой пятерку пятерками
"квинтиллион", пятёрка пятёрки пятёрке пятёрками пятёркой пятёрку пятёрками
] пятёра пятёры пятёре пятёрами пятёрой пятёру пятёрами
пятера пятеры пятере пятерами пятерой пятеру пятерами
пятак пятаки пятаке пятаками пятаком пятаку пятаками
шесть шестерка шестого шестому шестой шестом шестым шестью шести шестеро шестерых
вшестером
семь семерка седьмого седьмому седьмой седьмом седьмым семью семи семеро седьмых
всемером
восемь восьмерка восьмого восьмому восемью восьмой восьмом восьмым восеми восьмером восьми восьмью
восьмерых
ввосьмером
девять девятого девятому девятка девятом девятый девятым девятью девяти девятером вдевятером девятерых
вдевятером
десять десятого десятому десятка десятом десятый десятым десятью десяти десятером десятых
вдесятером
одиннадцать одиннадцатого одиннадцатому одиннадцатом одиннадцатый одиннадцатым одиннадцатью одиннадцати
одиннадцатых
двенадцать двенадцатого двенадцатому двенадцатом двенадцатый двенадцатым двенадцатью двенадцати
двенадцатых
тринадцать тринадцатого тринадцатому тринадцатом тринадцатый тринадцатым тринадцатью тринадцати
тринадцатых
четырнадцать четырнадцатого четырнадцатому четырнадцатом четырнадцатый четырнадцатым четырнадцатью четырнадцати
четырнадцатых
пятнадцать пятнадцатого пятнадцатому пятнадцатом пятнадцатый пятнадцатым пятнадцатью пятнадцати
пятнадцатых
пятнарик пятнарику пятнариком пятнарики
шестнадцать шестнадцатого шестнадцатому шестнадцатом шестнадцатый шестнадцатым шестнадцатью шестнадцати
шестнадцатых
семнадцать семнадцатого семнадцатому семнадцатом семнадцатый семнадцатым семнадцатью семнадцати семнадцатых
восемнадцать восемнадцатого восемнадцатому восемнадцатом восемнадцатый восемнадцатым восемнадцатью восемнадцати
восемнадцатых
девятнадцать девятнадцатого девятнадцатому девятнадцатом девятнадцатый девятнадцатым девятнадцатью девятнадцати
девятнадцатых
двадцать двадцатого двадцатому двадцатом двадцатый двадцатым двадцатью двадцати двадцатых
четвертак четвертака четвертаке четвертаку четвертаки четвертаком четвертаками
тридцать тридцатого тридцатому тридцатом тридцатый тридцатым тридцатью тридцати тридцатых
тридцадка тридцадку тридцадке тридцадки тридцадкой тридцадкою тридцадками
тридевять тридевяти тридевятью
сорок сорокового сороковому сороковом сороковым сороковой сороковых
сорокет сорокета сорокету сорокете сорокеты сорокетом сорокетами сорокетам
пятьдесят пятьдесятого пятьдесятому пятьюдесятью пятьдесятом пятьдесятый пятьдесятым пятидесяти пятьдесятых
полтинник полтинника полтиннике полтиннику полтинники полтинником полтинниками полтинникам полтинниках
пятидесятка пятидесятке пятидесятку пятидесятки пятидесяткой пятидесятками пятидесяткам пятидесятках
полтос полтоса полтосе полтосу полтосы полтосом полтосами полтосам полтосах
шестьдесят шестьдесятого шестьдесятому шестьюдесятью шестьдесятом шестьдесятый шестьдесятым шестидесятые шестидесяти
шестьдесятых
семьдесят семьдесятого семьдесятому семьюдесятью семьдесятом семьдесятый семьдесятым семидесяти семьдесятых
восемьдесят восемьдесятого восемьдесятому восемьюдесятью восемьдесятом восемьдесятый восемьдесятым восемидесяти
восьмидесяти восьмидесятых
девяносто девяностого девяностому девяностом девяностый девяностым девяноста девяностых
сто сотого сотому сотом сотен сотый сотым ста
стольник стольника стольнику стольнике стольники стольником стольниками
сотка сотки сотке соткой сотками соткам сотках
сотня сотни сотне сотней сотнями сотням сотнях
двести двумястами двухсотого двухсотому двухсотом двухсотый двухсотым двумстам двухстах двухсот
триста тремястами трехсотого трехсотому трехсотом трехсотый трехсотым тремстам трехстах трехсот
четыреста четырехсотого четырехсотому четырьмястами четырехсотом четырехсотый четырехсотым четыремстам четырехстах
четырехсот
пятьсот пятисотого пятисотому пятьюстами пятисотом пятисотый пятисотым пятистам пятистах пятисот
пятисотка пятисотки пятисотке пятисоткой пятисотками пятисоткам пятисоткою пятисотках
пятихатка пятихатки пятихатке пятихаткой пятихатками пятихаткам пятихаткою пятихатках
пятифан пятифаны пятифане пятифаном пятифанами пятифанах
шестьсот шестисотого шестисотому шестьюстами шестисотом шестисотый шестисотым шестистам шестистах шестисот
семьсот семисотого семисотому семьюстами семисотом семисотый семисотым семистам семистах семисот
восемьсот восемисотого восемисотому восемисотом восемисотый восемисотым восьмистами восьмистам восьмистах восьмисот
девятьсот девятисотого девятисотому девятьюстами девятисотом девятисотый девятисотым девятистам девятистах девятисот
тысяча тысячного тысячному тысячном тысячный тысячным тысячам тысячах тысячей тысяч тысячи тыс
косарь косаря косару косарем косарями косарях косарям косарей
десятитысячный десятитысячного десятитысячному десятитысячным десятитысячном десятитысячная десятитысячной
десятитысячную десятитысячною десятитысячное десятитысячные десятитысячных десятитысячными
двадцатитысячный двадцатитысячного двадцатитысячному двадцатитысячным двадцатитысячном двадцатитысячная
двадцатитысячной двадцатитысячную двадцатитысячною двадцатитысячное двадцатитысячные двадцатитысячных
двадцатитысячными
тридцатитысячный тридцатитысячного тридцатитысячному тридцатитысячным тридцатитысячном тридцатитысячная
тридцатитысячной тридцатитысячную тридцатитысячною тридцатитысячное тридцатитысячные тридцатитысячных
тридцатитысячными
сорокатысячный сорокатысячного сорокатысячному сорокатысячным сорокатысячном сорокатысячная
сорокатысячной сорокатысячную сорокатысячною сорокатысячное сорокатысячные сорокатысячных
сорокатысячными
пятидесятитысячный пятидесятитысячного пятидесятитысячному пятидесятитысячным пятидесятитысячном пятидесятитысячная
пятидесятитысячной пятидесятитысячную пятидесятитысячною пятидесятитысячное пятидесятитысячные пятидесятитысячных
пятидесятитысячными
шестидесятитысячный шестидесятитысячного шестидесятитысячному шестидесятитысячным шестидесятитысячном шестидесятитысячная
шестидесятитысячной шестидесятитысячную шестидесятитысячною шестидесятитысячное шестидесятитысячные шестидесятитысячных
шестидесятитысячными
семидесятитысячный семидесятитысячного семидесятитысячному семидесятитысячным семидесятитысячном семидесятитысячная
семидесятитысячной семидесятитысячную семидесятитысячною семидесятитысячное семидесятитысячные семидесятитысячных
семидесятитысячными
восьмидесятитысячный восьмидесятитысячного восьмидесятитысячному восьмидесятитысячным восьмидесятитысячном восьмидесятитысячная
восьмидесятитысячной восьмидесятитысячную восьмидесятитысячною восьмидесятитысячное восьмидесятитысячные восьмидесятитысячных
восьмидесятитысячными
стотысячный стотысячного стотысячному стотысячным стотысячном стотысячная стотысячной стотысячную стотысячное
стотысячные стотысячных стотысячными стотысячною
миллион миллионного миллионов миллионному миллионном миллионный миллионным миллионом миллиона миллионе миллиону
миллионов
лям ляма лямы лямом лямами лямах лямов
млн
десятимиллионная десятимиллионной десятимиллионными десятимиллионный десятимиллионным десятимиллионному
десятимиллионными десятимиллионную десятимиллионное десятимиллионные десятимиллионных десятимиллионною
миллиард миллиардного миллиардному миллиардном миллиардный миллиардным миллиардом миллиарда миллиарде миллиарду
миллиардов
лярд лярда лярды лярдом лярдами лярдах лярдов
млрд
триллион триллионного триллионному триллионном триллионный триллионным триллионом триллиона триллионе триллиону
триллионов трлн
квадриллион квадриллионного квадриллионному квадриллионный квадриллионным квадриллионом квадриллиона квадриллионе
квадриллиону квадриллионов квадрлн
квинтиллион квинтиллионного квинтиллионному квинтиллионный квинтиллионным квинтиллионом квинтиллиона квинтиллионе
квинтиллиону квинтиллионов квинтлн
i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii xviii xix xx xxi xxii xxiii xxiv xxv xxvi xxvii xxvii xxix
""".split()
)
)
def like_num(text): def like_num(text):
if text.startswith(("+", "-", "±", "~")): if text.startswith(("+", "-", "±", "~")):
text = text[1:] text = text[1:]
if text.endswith("%"):
text = text[:-1]
text = text.replace(",", "").replace(".", "") text = text.replace(",", "").replace(".", "")
if text.isdigit(): if text.isdigit():
return True return True

View File

@ -1,52 +1,111 @@
STOP_WORDS = set( STOP_WORDS = set(
""" """
а а авось ага агу аж ай али алло ау ах ая
будем будет будете будешь буду будут будучи будь будьте бы был была были было б будем будет будете будешь буду будут будучи будь будьте бы был была были было
быть быть бац без безусловно бишь благо благодаря ближайшие близко более больше
будто бывает бывала бывали бываю бывают бытует
в вам вами вас весь во вот все всё всего всей всем всём всеми всему всех всею в вам вами вас весь во вот все всё всего всей всем всём всеми всему всех всею
всея всю вся вы всея всю вся вы ваш ваша ваше ваши вдали вдобавок вдруг ведь везде вернее
взаимно взаправду видно вишь включая вместо внакладе вначале вне вниз внизу
вновь вовсе возможно воистину вокруг вон вообще вопреки вперекор вплоть
вполне вправду вправе впрочем впрямь вресноту вроде вряд всегда всюду
всякий всякого всякой всячески вчеред
да для до г го где гораздо гав
его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею д да для до дабы давайте давно давным даже далее далеко дальше данная
данного данное данной данном данному данные данный данных дану данунах
даром де действительно довольно доколе доколь долго должен должна
должно должны должный дополнительно другая другие другим другими
других другое другой
же е его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею едва
ежели еле
за ж же
и из или им ими имъ их з за затем зато зачем здесь значит зря
и из или им ими имъ их ибо иль имеет имел имела имело именно иметь иначе
иногда иным иными итак ишь
й
к как кем ко когда кого ком кому комья которая которого которое которой котором к как кем ко когда кого ком кому комья которая которого которое которой котором
которому которою которую которые который которым которыми которых кто которому которою которую которые который которым которыми которых кто ка кабы
каждая каждое каждые каждый кажется казалась казались казалось казался казаться
какая какие каким какими каков какого какой какому какою касательно кой коли
коль конечно короче кроме кстати ку куда
меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего л ли либо лишь любая любого любое любой любом любую любыми любых
м меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего
моей моем моём моему моею можем может можете можешь мои мой моим моими моих моей моем моём моему моею можем может можете можешь мои мой моим моими моих
мочь мою моя мы мочь мою моя мы мало меж между менее меньше мимо многие много многого многое
многом многому можно мол му
на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим н на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим
нашими наших нашу не него нее неё ней нем нём нему нет нею ним ними них но нашими наших нашу не него нее неё ней нем нём нему нет нею ним ними них но
наверняка наверху навряд навыворот над надо назад наиболее наизворот
наизнанку наипаче накануне наконец наоборот наперед наперекор наподобие
например напротив напрямую насилу настоящая настоящее настоящие настоящий
насчет нате находиться начала начале неважно негде недавно недалеко незачем
некем некогда некому некоторая некоторые некоторый некоторых некто некуда
нельзя немногие немногим немного необходимо необходимости необходимые
необходимым неоткуда непрерывно нередко несколько нету неужели нечего
нечем нечему нечто нешто нибудь нигде ниже низко никак никакой никем
никогда никого никому никто никуда ниоткуда нипочем ничего ничем ничему
ничто ну нужная нужно нужного нужные нужный нужных ныне нынешнее нынешней
нынешних нынче
о об один одна одни одним одними одних одно одного одной одном одному одною о об один одна одни одним одними одних одно одного одной одном одному одною
одну он она оне они оно от одну он она оне они оно от оба общую обычно ого однажды однако ой около оный
оп опять особенно особо особую особые откуда отнелижа отнелиже отовсюду
отсюда оттого оттот оттуда отчего отчему ох очевидно очень ом
по при п по при паче перед под подавно поди подобная подобно подобного подобные
подобный подобным подобных поелику пожалуй пожалуйста позже поистине
пока покамест поколе поколь покуда покудова помимо понеже поприще пор
пора посему поскольку после посреди посредством потом потому потомушта
похожем почему почти поэтому прежде притом причем про просто прочего
прочее прочему прочими проще прям пусть
р ради разве ранее рано раньше рядом
с сам сама сами самим самими самих само самого самом самому саму свое своё с сам сама сами самим самими самих само самого самом самому саму свое своё
своего своей своем своём своему своею свои свой своим своими своих свою своя своего своей своем своём своему своею свои свой своим своими своих свою своя
себе себя собой собою себе себя собой собою самая самое самой самый самых сверх свыше се сего сей
сейчас сие сих сквозь сколько скорее скоро следует слишком смогут сможет
сначала снова со собственно совсем сперва спокону спустя сразу среди сродни
стал стала стали стало стать суть сызнова
та так такая такие таким такими таких такого такое такой таком такому такою та то ту ты ти так такая такие таким такими таких такого такое такой таком такому такою
такую те тебе тебя тем теми тех то тобой тобою того той только том томах тому такую те тебе тебя тем теми тех тобой тобою того той только том томах тому
тот тою ту ты тот тою также таки таков такова там твои твоим твоих твой твоя твоё
теперь тогда тоже тотчас точно туда тут тьфу тая
у уже у уже увы уж ура ух ую
чего чем чём чему что чтобы ф фу
эта эти этим этими этих это этого этой этом этому этот этою эту х ха хе хорошо хотел хотела хотелось хотеть хоть хотя хочешь хочу хуже
я ч чего чем чём чему что чтобы часто чаще чей через чтоб чуть чхать чьим
чьих чьё чё
ш ша
щ ща щас
ы ых ые ый
э эта эти этим этими этих это этого этой этом этому этот этою эту эдак эдакий
эй эка экий этак этакий эх
ю
я явно явных яко якобы якоже
""".split() """.split()
) )

View File

@ -2,7 +2,6 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS
from ...symbols import ORTH, NORM from ...symbols import ORTH, NORM
from ...util import update_exc from ...util import update_exc
_exc = {} _exc = {}
_abbrev_exc = [ _abbrev_exc = [
@ -42,7 +41,6 @@ _abbrev_exc = [
{ORTH: "дек", NORM: "декабрь"}, {ORTH: "дек", NORM: "декабрь"},
] ]
for abbrev_desc in _abbrev_exc: for abbrev_desc in _abbrev_exc:
abbrev = abbrev_desc[ORTH] abbrev = abbrev_desc[ORTH]
for orth in (abbrev, abbrev.capitalize(), abbrev.upper()): for orth in (abbrev, abbrev.capitalize(), abbrev.upper()):
@ -50,17 +48,354 @@ for abbrev_desc in _abbrev_exc:
_exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}] _exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}]
_slang_exc = [ for abbr in [
# Year slang abbreviations
{ORTH: "2к15", NORM: "2015"}, {ORTH: "2к15", NORM: "2015"},
{ORTH: "2к16", NORM: "2016"}, {ORTH: "2к16", NORM: "2016"},
{ORTH: "2к17", NORM: "2017"}, {ORTH: "2к17", NORM: "2017"},
{ORTH: "2к18", NORM: "2018"}, {ORTH: "2к18", NORM: "2018"},
{ORTH: "2к19", NORM: "2019"}, {ORTH: "2к19", NORM: "2019"},
{ORTH: "2к20", NORM: "2020"}, {ORTH: "2к20", NORM: "2020"},
] {ORTH: "2к21", NORM: "2021"},
{ORTH: "2к22", NORM: "2022"},
{ORTH: "2к23", NORM: "2023"},
{ORTH: "2к24", NORM: "2024"},
{ORTH: "2к25", NORM: "2025"},
]:
_exc[abbr[ORTH]] = [abbr]
for slang_desc in _slang_exc: for abbr in [
_exc[slang_desc[ORTH]] = [slang_desc] # Profession and academic titles abbreviations
{ORTH: "ак.", NORM: "академик"},
{ORTH: "акад.", NORM: "академик"},
{ORTH: "д-р архитектуры", NORM: "доктор архитектуры"},
{ORTH: "д-р биол. наук", NORM: "доктор биологических наук"},
{ORTH: "д-р ветеринар. наук", NORM: "доктор ветеринарных наук"},
{ORTH: "д-р воен. наук", NORM: "доктор военных наук"},
{ORTH: "д-р геогр. наук", NORM: "доктор географических наук"},
{ORTH: "д-р геол.-минерал. наук", NORM: "доктор геолого-минералогических наук"},
{ORTH: "д-р искусствоведения", NORM: "доктор искусствоведения"},
{ORTH: "д-р ист. наук", NORM: "доктор исторических наук"},
{ORTH: "д-р культурологии", NORM: "доктор культурологии"},
{ORTH: "д-р мед. наук", NORM: "доктор медицинских наук"},
{ORTH: "д-р пед. наук", NORM: "доктор педагогических наук"},
{ORTH: "д-р полит. наук", NORM: "доктор политических наук"},
{ORTH: "д-р психол. наук", NORM: "доктор психологических наук"},
{ORTH: "д-р с.-х. наук", NORM: "доктор сельскохозяйственных наук"},
{ORTH: "д-р социол. наук", NORM: "доктор социологических наук"},
{ORTH: "д-р техн. наук", NORM: "доктор технических наук"},
{ORTH: "д-р фармацевт. наук", NORM: "доктор фармацевтических наук"},
{ORTH: "д-р физ.-мат. наук", NORM: "доктор физико-математических наук"},
{ORTH: "д-р филол. наук", NORM: "доктор филологических наук"},
{ORTH: "д-р филос. наук", NORM: "доктор философских наук"},
{ORTH: "д-р хим. наук", NORM: "доктор химических наук"},
{ORTH: "д-р экон. наук", NORM: "доктор экономических наук"},
{ORTH: "д-р юрид. наук", NORM: "доктор юридических наук"},
{ORTH: "д-р", NORM: "доктор"},
{ORTH: "д.б.н.", NORM: "доктор биологических наук"},
{ORTH: "д.г.-м.н.", NORM: "доктор геолого-минералогических наук"},
{ORTH: "д.г.н.", NORM: "доктор географических наук"},
{ORTH: "д.и.н.", NORM: "доктор исторических наук"},
{ORTH: "д.иск.", NORM: "доктор искусствоведения"},
{ORTH: "д.м.н.", NORM: "доктор медицинских наук"},
{ORTH: "д.п.н.", NORM: "доктор психологических наук"},
{ORTH: "д.пед.н.", NORM: "доктор педагогических наук"},
{ORTH: "д.полит.н.", NORM: "доктор политических наук"},
{ORTH: "д.с.-х.н.", NORM: "доктор сельскохозяйственных наук"},
{ORTH: "д.социол.н.", NORM: "доктор социологических наук"},
{ORTH: "д.т.н.", NORM: "доктор технических наук"},
{ORTH: "д.т.н", NORM: "доктор технических наук"},
{ORTH: "д.ф.-м.н.", NORM: "доктор физико-математических наук"},
{ORTH: "д.ф.н.", NORM: "доктор филологических наук"},
{ORTH: "д.филос.н.", NORM: "доктор философских наук"},
{ORTH: "д.фил.н.", NORM: "доктор филологических наук"},
{ORTH: "д.х.н.", NORM: "доктор химических наук"},
{ORTH: "д.э.н.", NORM: "доктор экономических наук"},
{ORTH: "д.э.н", NORM: "доктор экономических наук"},
{ORTH: "д.ю.н.", NORM: "доктор юридических наук"},
{ORTH: "доц.", NORM: "доцент"},
{ORTH: "и.о.", NORM: "исполняющий обязанности"},
{ORTH: "к.б.н.", NORM: "кандидат биологических наук"},
{ORTH: "к.воен.н.", NORM: "кандидат военных наук"},
{ORTH: "к.г.-м.н.", NORM: "кандидат геолого-минералогических наук"},
{ORTH: "к.г.н.", NORM: "кандидат географических наук"},
{ORTH: "к.геогр", NORM: "кандидат географических наук"},
{ORTH: "к.геогр.наук", NORM: "кандидат географических наук"},
{ORTH: "к.и.н.", NORM: "кандидат исторических наук"},
{ORTH: "к.иск.", NORM: "кандидат искусствоведения"},
{ORTH: "к.м.н.", NORM: "кандидат медицинских наук"},
{ORTH: "к.п.н.", NORM: "кандидат психологических наук"},
{ORTH: "к.псх.н.", NORM: "кандидат психологических наук"},
{ORTH: "к.пед.н.", NORM: "кандидат педагогических наук"},
{ORTH: "канд.пед.наук", NORM: "кандидат педагогических наук"},
{ORTH: "к.полит.н.", NORM: "кандидат политических наук"},
{ORTH: "к.с.-х.н.", NORM: "кандидат сельскохозяйственных наук"},
{ORTH: "к.социол.н.", NORM: "кандидат социологических наук"},
{ORTH: "к.с.н.", NORM: "кандидат социологических наук"},
{ORTH: "к.т.н.", NORM: "кандидат технических наук"},
{ORTH: "к.ф.-м.н.", NORM: "кандидат физико-математических наук"},
{ORTH: "к.ф.н.", NORM: "кандидат филологических наук"},
{ORTH: "к.фил.н.", NORM: "кандидат филологических наук"},
{ORTH: "к.филол.н", NORM: "кандидат филологических наук"},
{ORTH: "к.фарм.наук", NORM: "кандидат фармакологических наук"},
{ORTH: "к.фарм.н.", NORM: "кандидат фармакологических наук"},
{ORTH: "к.фарм.н", NORM: "кандидат фармакологических наук"},
{ORTH: "к.филос.наук", NORM: "кандидат философских наук"},
{ORTH: "к.филос.н.", NORM: "кандидат философских наук"},
{ORTH: "к.филос.н", NORM: "кандидат философских наук"},
{ORTH: "к.х.н.", NORM: "кандидат химических наук"},
{ORTH: "к.х", NORM: "кандидат химических наук"},
{ORTH: "к.э.н.", NORM: "кандидат экономических наук"},
{ORTH: "к.э.н", NORM: "кандидат экономических наук"},
{ORTH: "к.ю.н.", NORM: "кандидат юридических наук"},
{ORTH: "к.ю.н", NORM: "кандидат юридических наук"},
{ORTH: "канд. архитектуры", NORM: "кандидат архитектуры"},
{ORTH: "канд. биол. наук", NORM: "кандидат биологических наук"},
{ORTH: "канд. ветеринар. наук", NORM: "кандидат ветеринарных наук"},
{ORTH: "канд. воен. наук", NORM: "кандидат военных наук"},
{ORTH: "канд. геогр. наук", NORM: "кандидат географических наук"},
{ORTH: "канд. геол.-минерал. наук", NORM: "кандидат геолого-минералогических наук"},
{ORTH: "канд. искусствоведения", NORM: "кандидат искусствоведения"},
{ORTH: "канд. ист. наук", NORM: "кандидат исторических наук"},
{ORTH: "к.ист.н.", NORM: "кандидат исторических наук"},
{ORTH: "канд. культурологии", NORM: "кандидат культурологии"},
{ORTH: "канд. мед. наук", NORM: "кандидат медицинских наук"},
{ORTH: "канд. пед. наук", NORM: "кандидат педагогических наук"},
{ORTH: "канд. полит. наук", NORM: "кандидат политических наук"},
{ORTH: "канд. психол. наук", NORM: "кандидат психологических наук"},
{ORTH: "канд. с.-х. наук", NORM: "кандидат сельскохозяйственных наук"},
{ORTH: "канд. социол. наук", NORM: "кандидат социологических наук"},
{ORTH: "к.соц.наук", NORM: "кандидат социологических наук"},
{ORTH: "к.соц.н.", NORM: "кандидат социологических наук"},
{ORTH: "к.соц.н", NORM: "кандидат социологических наук"},
{ORTH: "канд. техн. наук", NORM: "кандидат технических наук"},
{ORTH: "канд. фармацевт. наук", NORM: "кандидат фармацевтических наук"},
{ORTH: "канд. физ.-мат. наук", NORM: "кандидат физико-математических наук"},
{ORTH: "канд. филол. наук", NORM: "кандидат филологических наук"},
{ORTH: "канд. филос. наук", NORM: "кандидат философских наук"},
{ORTH: "канд. хим. наук", NORM: "кандидат химических наук"},
{ORTH: "канд. экон. наук", NORM: "кандидат экономических наук"},
{ORTH: "канд. юрид. наук", NORM: "кандидат юридических наук"},
{ORTH: "в.н.с.", NORM: "ведущий научный сотрудник"},
{ORTH: "мл. науч. сотр.", NORM: "младший научный сотрудник"},
{ORTH: "м.н.с.", NORM: "младший научный сотрудник"},
{ORTH: "проф.", NORM: "профессор"},
{ORTH: "профессор.кафедры", NORM: "профессор кафедры"},
{ORTH: "ст. науч. сотр.", NORM: "старший научный сотрудник"},
{ORTH: "чл.-к.", NORM: "член корреспондент"},
{ORTH: "чл.-корр.", NORM: "член-корреспондент"},
{ORTH: "чл.-кор.", NORM: "член-корреспондент"},
{ORTH: "дир.", NORM: "директор"},
{ORTH: "зам. дир.", NORM: "заместитель директора"},
{ORTH: "зав. каф.", NORM: "заведующий кафедрой"},
{ORTH: "зав.кафедрой", NORM: "заведующий кафедрой"},
{ORTH: "зав. кафедрой", NORM: "заведующий кафедрой"},
{ORTH: "асп.", NORM: "аспирант"},
{ORTH: "гл. науч. сотр.", NORM: "главный научный сотрудник"},
{ORTH: "вед. науч. сотр.", NORM: "ведущий научный сотрудник"},
{ORTH: "науч. сотр.", NORM: "научный сотрудник"},
{ORTH: "к.м.с.", NORM: "кандидат в мастера спорта"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
# Literary phrases abbreviations
{ORTH: "и т.д.", NORM: "и так далее"},
{ORTH: "и т.п.", NORM: "и тому подобное"},
{ORTH: "т.д.", NORM: "так далее"},
{ORTH: "т.п.", NORM: "тому подобное"},
{ORTH: "т.е.", NORM: "то есть"},
{ORTH: "т.к.", NORM: "так как"},
{ORTH: "в т.ч.", NORM: "в том числе"},
{ORTH: "и пр.", NORM: "и прочие"},
{ORTH: "и др.", NORM: "и другие"},
{ORTH: "т.н.", NORM: "так называемый"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
# Appeal to a person abbreviations
{ORTH: "г", NORM: "господин"},
{ORTH: "г-да", NORM: "господа"},
{ORTH: "г-жа", NORM: "госпожа"},
{ORTH: "тов.", NORM: "товарищ"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
# Time periods abbreviations
{ORTH: "до н.э.", NORM: "до нашей эры"},
{ORTH: "по н.в.", NORM: "по настоящее время"},
{ORTH: "в н.в.", NORM: "в настоящее время"},
{ORTH: "наст.", NORM: "настоящий"},
{ORTH: "наст. время", NORM: "настоящее время"},
{ORTH: "г.г.", NORM: "годы"},
{ORTH: "гг.", NORM: "годы"},
{ORTH: "т.г.", NORM: "текущий год"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
# Address forming elements abbreviations
{ORTH: "респ.", NORM: "республика"},
{ORTH: "обл.", NORM: "область"},
{ORTH: "г.ф.з.", NORM: "город федерального значения"},
{ORTH: "а.обл.", NORM: "автономная область"},
{ORTH: "а.окр.", NORM: "автономный округ"},
{ORTH: "м.р", NORM: "муниципальный район"},
{ORTH: "г.о.", NORM: "городской округ"},
{ORTH: "г.п.", NORM: "городское поселение"},
{ORTH: "с.п.", NORM: "сельское поселение"},
{ORTH: "вн.р", NORM: "внутригородской район"},
{ORTH: "вн.тер.г.", NORM: "внутригородская территория города"},
{ORTH: "пос.", NORM: "поселение"},
{ORTH: "р", NORM: "район"},
{ORTH: "с/с", NORM: "сельсовет"},
{ORTH: "г.", NORM: "город"},
{ORTH: "п.г.т.", NORM: "поселок городского типа"},
{ORTH: "пгт.", NORM: "поселок городского типа"},
{ORTH: "р.п.", NORM: "рабочий поселок"},
{ORTH: "рп.", NORM: "рабочий поселок"},
{ORTH: "кп.", NORM: "курортный поселок"},
{ORTH: "гп.", NORM: "городской поселок"},
{ORTH: "п.", NORM: "поселок"},
{ORTH: "в-ки", NORM: "выселки"},
{ORTH: "г", NORM: "городок"},
{ORTH: "з-ка", NORM: "заимка"},
{ORTH: "п-к", NORM: "починок"},
{ORTH: "киш.", NORM: "кишлак"},
{ORTH: "п. ст. ", NORM: "поселок станция"},
{ORTH: "п. ж/д ст. ", NORM: "поселок при железнодорожной станции"},
{ORTH: "ж/д бл-ст", NORM: "железнодорожный блокпост"},
{ORTH: "ж/д б-ка", NORM: "железнодорожная будка"},
{ORTH: "ж/д в-ка", NORM: "железнодорожная ветка"},
{ORTH: "ж/д к-ма", NORM: "железнодорожная казарма"},
{ORTH: "ж/д к-т", NORM: "железнодорожный комбинат"},
{ORTH: "ж/д пл-ма", NORM: "железнодорожная платформа"},
{ORTH: "ж/д пл-ка", NORM: "железнодорожная площадка"},
{ORTH: "ж/д п.п.", NORM: "железнодорожный путевой пост"},
{ORTH: "ж/д о.п.", NORM: "железнодорожный остановочный пункт"},
{ORTH: "ж/д рзд.", NORM: "железнодорожный разъезд"},
{ORTH: "ж/д ст. ", NORM: "железнодорожная станция"},
{ORTH: "м-ко", NORM: "местечко"},
{ORTH: "д.", NORM: "деревня"},
{ORTH: "с.", NORM: "село"},
{ORTH: "сл.", NORM: "слобода"},
{ORTH: "ст. ", NORM: "станция"},
{ORTH: "ст-ца", NORM: "станица"},
{ORTH: "у.", NORM: "улус"},
{ORTH: "х.", NORM: "хутор"},
{ORTH: "рзд.", NORM: "разъезд"},
{ORTH: "зим.", NORM: "зимовье"},
{ORTH: "б-г", NORM: "берег"},
{ORTH: "ж/р", NORM: "жилой район"},
{ORTH: "кв-л", NORM: "квартал"},
{ORTH: "мкр.", NORM: "микрорайон"},
{ORTH: "ост-в", NORM: "остров"},
{ORTH: "платф.", NORM: "платформа"},
{ORTH: "п/р", NORM: "промышленный район"},
{ORTH: "р", NORM: "район"},
{ORTH: "тер.", NORM: "территория"},
{
ORTH: "тер. СНО",
NORM: "территория садоводческих некоммерческих объединений граждан",
},
{
ORTH: "тер. ОНО",
NORM: "территория огороднических некоммерческих объединений граждан",
},
{ORTH: "тер. ДНО", NORM: "территория дачных некоммерческих объединений граждан"},
{ORTH: "тер. СНТ", NORM: "территория садоводческих некоммерческих товариществ"},
{ORTH: "тер. ОНТ", NORM: "территория огороднических некоммерческих товариществ"},
{ORTH: "тер. ДНТ", NORM: "территория дачных некоммерческих товариществ"},
{ORTH: "тер. СПК", NORM: "территория садоводческих потребительских кооперативов"},
{ORTH: "тер. ОПК", NORM: "территория огороднических потребительских кооперативов"},
{ORTH: "тер. ДПК", NORM: "территория дачных потребительских кооперативов"},
{ORTH: "тер. СНП", NORM: "территория садоводческих некоммерческих партнерств"},
{ORTH: "тер. ОНП", NORM: "территория огороднических некоммерческих партнерств"},
{ORTH: "тер. ДНП", NORM: "территория дачных некоммерческих партнерств"},
{ORTH: "тер. ТСН", NORM: "территория товарищества собственников недвижимости"},
{ORTH: "тер. ГСК", NORM: "территория гаражно-строительного кооператива"},
{ORTH: "ус.", NORM: "усадьба"},
{ORTH: "тер.ф.х.", NORM: "территория фермерского хозяйства"},
{ORTH: "ю.", NORM: "юрты"},
{ORTH: "ал.", NORM: "аллея"},
{ORTH: "б-р", NORM: "бульвар"},
{ORTH: "взв.", NORM: "взвоз"},
{ORTH: "взд.", NORM: "въезд"},
{ORTH: "дор.", NORM: "дорога"},
{ORTH: "ззд.", NORM: "заезд"},
{ORTH: "км", NORM: "километр"},
{ORTH: "к-цо", NORM: "кольцо"},
{ORTH: "лн.", NORM: "линия"},
{ORTH: "мгстр.", NORM: "магистраль"},
{ORTH: "наб.", NORM: "набережная"},
{ORTH: "пер-д", NORM: "переезд"},
{ORTH: "пер.", NORM: "переулок"},
{ORTH: "пл-ка", NORM: "площадка"},
{ORTH: "пл.", NORM: "площадь"},
{ORTH: "пр-д", NORM: "проезд"},
{ORTH: "пр-к", NORM: "просек"},
{ORTH: "пр-ка", NORM: "просека"},
{ORTH: "пр-лок", NORM: "проселок"},
{ORTH: "пр-кт", NORM: "проспект"},
{ORTH: "проул.", NORM: "проулок"},
{ORTH: "рзд.", NORM: "разъезд"},
{ORTH: "ряд", NORM: "ряд(ы)"},
{ORTH: "с-р", NORM: "сквер"},
{ORTH: "с", NORM: "спуск"},
{ORTH: "сзд.", NORM: "съезд"},
{ORTH: "туп.", NORM: "тупик"},
{ORTH: "ул.", NORM: "улица"},
{ORTH: "ш.", NORM: "шоссе"},
{ORTH: "влд.", NORM: "владение"},
{ORTH: "г", NORM: "гараж"},
{ORTH: "д.", NORM: "дом"},
{ORTH: "двлд.", NORM: "домовладение"},
{ORTH: "зд.", NORM: "здание"},
{ORTH: "з/у", NORM: "земельный участок"},
{ORTH: "кв.", NORM: "квартира"},
{ORTH: "ком.", NORM: "комната"},
{ORTH: "подв.", NORM: "подвал"},
{ORTH: "кот.", NORM: "котельная"},
{ORTH: "п-б", NORM: "погреб"},
{ORTH: "к.", NORM: "корпус"},
{ORTH: "ОНС", NORM: "объект незавершенного строительства"},
{ORTH: "оф.", NORM: "офис"},
{ORTH: "пав.", NORM: "павильон"},
{ORTH: "помещ.", NORM: "помещение"},
{ORTH: "раб.уч.", NORM: "рабочий участок"},
{ORTH: "скл.", NORM: "склад"},
{ORTH: "coop.", NORM: "сооружение"},
{ORTH: "стр.", NORM: "строение"},
{ORTH: "торг.зал", NORM: "торговый зал"},
{ORTH: "а/п", NORM: "аэропорт"},
{ORTH: "им.", NORM: "имени"},
]:
_exc[abbr[ORTH]] = [abbr]
for abbr in [
# Others abbreviations
{ORTH: "тыс.руб.", NORM: "тысяч рублей"},
{ORTH: "тыс.", NORM: "тысяч"},
{ORTH: "руб.", NORM: "рубль"},
{ORTH: "долл.", NORM: "доллар"},
{ORTH: "прим.", NORM: "примечание"},
{ORTH: "прим.ред.", NORM: "примечание редакции"},
{ORTH: "см. также", NORM: "смотри также"},
{ORTH: "кв.м.", NORM: "квадрантный метр"},
{ORTH: "м2", NORM: "квадрантный метр"},
{ORTH: "б/у", NORM: "бывший в употреблении"},
{ORTH: "сокр.", NORM: "сокращение"},
{ORTH: "чел.", NORM: "человек"},
{ORTH: "б.п.", NORM: "базисный пункт"},
]:
_exc[abbr[ORTH]] = [abbr]
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc) TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)

18
spacy/lang/sl/examples.py Normal file
View File

@ -0,0 +1,18 @@
"""
Example sentences to test spaCy and its language models.
>>> from spacy.lang.sl.examples import sentences
>>> docs = nlp.pipe(sentences)
"""
sentences = [
"Apple načrtuje nakup britanskega startupa za 1 bilijon dolarjev",
"France Prešeren je umrl 8. februarja 1849 v Kranju",
"Staro ljubljansko letališče Moste bo obnovila družba BTC",
"London je največje mesto v Združenem kraljestvu.",
"Kje se skrivaš?",
"Kdo je predsednik Francije?",
"Katero je glavno mesto Združenih držav Amerike?",
"Kdaj je bil rojen Milan Kučan?",
]

View File

@ -53,7 +53,7 @@ _ordinal_words = [
"doksanıncı", "doksanıncı",
"yüzüncü", "yüzüncü",
"bininci", "bininci",
"mliyonuncu", "milyonuncu",
"milyarıncı", "milyarıncı",
"trilyonuncu", "trilyonuncu",
"katrilyonuncu", "katrilyonuncu",

View File

@ -2,22 +2,29 @@ from ...attrs import LIKE_NUM
_num_words = [ _num_words = [
"không", "không", # Zero
"một", "một", # One
"hai", "mốt", # Also one, irreplacable in niché cases for unit digit such as "51"="năm mươi mốt"
"ba", "hai", # Two
"bốn", "ba", # Three
"năm", "bốn", # Four
"sáu", "", # Also four, used in certain cases for unit digit such as "54"="năm mươi tư"
"bảy", "năm", # Five
"bẩy", "lăm", # Also five, irreplacable in niché cases for unit digit such as "55"="năm mươi lăm"
"tám", "sáu", # Six
"chín", "bảy", # Seven
"mười", "bẩy", # Also seven, old fashioned
"chục", "tám", # Eight
"trăm", "chín", # Nine
"nghìn", "mười", # Ten
"tỷ", "chục", # Also ten, used for counting in tens such as "20 eggs"="hai chục trứng"
"trăm", # Hundred
"nghìn", # Thousand
"ngàn", # Also thousand, used in the south
"vạn", # Ten thousand
"triệu", # Million
"tỷ", # Billion
"tỉ", # Also billion, used in combinatorics such as "tỉ_phú"="billionaire"
] ]

View File

@ -131,7 +131,7 @@ class Language:
self, self,
vocab: Union[Vocab, bool] = True, vocab: Union[Vocab, bool] = True,
*, *,
max_length: int = 10 ** 6, max_length: int = 10**6,
meta: Dict[str, Any] = {}, meta: Dict[str, Any] = {},
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None, create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
batch_size: int = 1000, batch_size: int = 1000,
@ -1222,8 +1222,9 @@ class Language:
component_cfg = {} component_cfg = {}
grads = {} grads = {}
def get_grads(W, dW, key=None): def get_grads(key, W, dW):
grads[key] = (W, dW) grads[key] = (W, dW)
return W, dW
get_grads.learn_rate = sgd.learn_rate # type: ignore[attr-defined, union-attr] get_grads.learn_rate = sgd.learn_rate # type: ignore[attr-defined, union-attr]
get_grads.b1 = sgd.b1 # type: ignore[attr-defined, union-attr] get_grads.b1 = sgd.b1 # type: ignore[attr-defined, union-attr]
@ -1236,7 +1237,7 @@ class Language:
examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {}) examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {})
) )
for key, (W, dW) in grads.items(): for key, (W, dW) in grads.items():
sgd(W, dW, key=key) # type: ignore[call-arg, misc] sgd(key, W, dW) # type: ignore[call-arg, misc]
return losses return losses
def begin_training( def begin_training(

View File

@ -244,6 +244,10 @@ cdef class Matcher:
pipe = "parser" pipe = "parser"
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr)) error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
raise ValueError(error_msg) raise ValueError(error_msg)
if self.patterns.empty():
matches = []
else:
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length, matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments) extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
final_matches = [] final_matches = []
@ -686,18 +690,14 @@ cdef int8_t get_is_match(PatternStateC state,
return True return True
cdef int8_t get_is_final(PatternStateC state) nogil: cdef inline int8_t get_is_final(PatternStateC state) nogil:
if state.pattern[1].quantifier == FINAL_ID: if state.pattern[1].quantifier == FINAL_ID:
id_attr = state.pattern[1].attrs[0]
if id_attr.attr != ID:
with gil:
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
return 1 return 1
else: else:
return 0 return 0
cdef int8_t get_quantifier(PatternStateC state) nogil: cdef inline int8_t get_quantifier(PatternStateC state) nogil:
return state.pattern.quantifier return state.pattern.quantifier

View File

@ -14,7 +14,7 @@ class PhraseMatcher:
def add( def add(
self, self,
key: str, key: str,
docs: List[List[Dict[str, Any]]], docs: List[Doc],
*, *,
on_match: Optional[ on_match: Optional[
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any] Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]

View File

@ -63,4 +63,4 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]: def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
return (Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths)) return Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths)

View File

@ -1,34 +1,82 @@
from pathlib import Path from pathlib import Path
from typing import Optional, Callable, Iterable, List from typing import Optional, Callable, Iterable, List, Tuple
from thinc.types import Floats2d from thinc.types import Floats2d
from thinc.api import chain, clone, list2ragged, reduce_mean, residual from thinc.api import chain, clone, list2ragged, reduce_mean, residual
from thinc.api import Model, Maxout, Linear from thinc.api import Model, Maxout, Linear, noop, tuplify, Ragged
from ...util import registry from ...util import registry
from ...kb import KnowledgeBase, Candidate, get_candidates from ...kb import KnowledgeBase, Candidate, get_candidates
from ...vocab import Vocab from ...vocab import Vocab
from ...tokens import Span, Doc from ...tokens import Span, Doc
from ..extract_spans import extract_spans
from ...errors import Errors
@registry.architectures("spacy.EntityLinker.v1") @registry.architectures("spacy.EntityLinker.v2")
def build_nel_encoder( def build_nel_encoder(
tok2vec: Model, nO: Optional[int] = None tok2vec: Model, nO: Optional[int] = None
) -> Model[List[Doc], Floats2d]: ) -> Model[List[Doc], Floats2d]:
with Model.define_operators({">>": chain, "**": clone}): with Model.define_operators({">>": chain, "&": tuplify}):
token_width = tok2vec.maybe_get_dim("nO") token_width = tok2vec.maybe_get_dim("nO")
output_layer = Linear(nO=nO, nI=token_width) output_layer = Linear(nO=nO, nI=token_width)
model = ( model = (
tok2vec ((tok2vec >> list2ragged()) & build_span_maker())
>> list2ragged() >> extract_spans()
>> reduce_mean() >> reduce_mean()
>> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) # type: ignore[arg-type] >> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) # type: ignore[arg-type]
>> output_layer >> output_layer
) )
model.set_ref("output_layer", output_layer) model.set_ref("output_layer", output_layer)
model.set_ref("tok2vec", tok2vec) model.set_ref("tok2vec", tok2vec)
# flag to show this isn't legacy
model.attrs["include_span_maker"] = True
return model return model
def build_span_maker(n_sents: int = 0) -> Model:
model: Model = Model("span_maker", forward=span_maker_forward)
model.attrs["n_sents"] = n_sents
return model
def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callable]:
ops = model.ops
n_sents = model.attrs["n_sents"]
candidates = []
for doc in docs:
cands = []
try:
sentences = [s for s in doc.sents]
except ValueError:
# no sentence info, normal in initialization
for tok in doc:
tok.is_sent_start = tok.i == 0
sentences = [doc[:]]
for ent in doc.ents:
try:
# find the sentence in the list of sentences.
sent_index = sentences.index(ent.sent)
except AttributeError:
# Catch the exception when ent.sent is None and provide a user-friendly warning
raise RuntimeError(Errors.E030) from None
# get n previous sentences, if there are any
start_sentence = max(0, sent_index - n_sents)
# get n posterior sentences, or as many < n as there are
end_sentence = min(len(sentences) - 1, sent_index + n_sents)
# get token positions
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
# save positions for extraction
cands.append((start_token, end_token))
candidates.append(ops.asarray2i(cands))
candlens = ops.asarray1i([len(cands) for cands in candidates])
candidates = ops.xp.concatenate(candidates)
outputs = Ragged(candidates, candlens)
# because this is just rearranging docs, the backprop does nothing
return outputs, lambda x: []
@registry.misc("spacy.KBFromFile.v1") @registry.misc("spacy.KBFromFile.v1")
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]: def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
def kb_from_file(vocab): def kb_from_file(vocab):

View File

@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f") target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
target = target.reshape((-1, 256 * nr_char)) target = target.reshape((-1, 256 * nr_char))
diff = prediction - target diff = prediction - target
loss = (diff ** 2).sum() loss = (diff**2).sum()
d_target = diff / float(prediction.shape[0]) d_target = diff / float(prediction.shape[0])
return loss, d_target return loss, d_target

View File

@ -1,14 +1,14 @@
from typing import Optional, List from typing import Optional, List
from thinc.api import zero_init, with_array, Softmax, chain, Model from thinc.api import zero_init, with_array, Softmax_v2, chain, Model
from thinc.types import Floats2d from thinc.types import Floats2d
from ...util import registry from ...util import registry
from ...tokens import Doc from ...tokens import Doc
@registry.architectures("spacy.Tagger.v1") @registry.architectures("spacy.Tagger.v2")
def build_tagger_model( def build_tagger_model(
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False
) -> Model[List[Doc], List[Floats2d]]: ) -> Model[List[Doc], List[Floats2d]]:
"""Build a tagger model, using a provided token-to-vector component. The tagger """Build a tagger model, using a provided token-to-vector component. The tagger
model simply adds a linear layer with softmax activation to predict scores model simply adds a linear layer with softmax activation to predict scores
@ -19,7 +19,9 @@ def build_tagger_model(
""" """
# TODO: glorot_uniform_init seems to work a bit better than zero_init here?! # TODO: glorot_uniform_init seems to work a bit better than zero_init here?!
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
output_layer = Softmax(nO, t2v_width, init_W=zero_init) output_layer = Softmax_v2(
nO, t2v_width, init_W=zero_init, normalize_outputs=normalize
)
softmax = with_array(output_layer) # type: ignore softmax = with_array(output_layer) # type: ignore
model = chain(tok2vec, softmax) model = chain(tok2vec, softmax)
model.set_ref("tok2vec", tok2vec) model.set_ref("tok2vec", tok2vec)

View File

@ -11,6 +11,7 @@ import numpy.random
from thinc.api import Model, CupyOps, NumpyOps from thinc.api import Model, CupyOps, NumpyOps
from .. import util from .. import util
from ..errors import Errors
from ..typedefs cimport weight_t, class_t, hash_t from ..typedefs cimport weight_t, class_t, hash_t
from ..pipeline._parser_internals.stateclass cimport StateClass from ..pipeline._parser_internals.stateclass cimport StateClass
@ -411,7 +412,7 @@ cdef class precompute_hiddens:
elif name == "nO": elif name == "nO":
return self.nO return self.nO
else: else:
raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") raise ValueError(Errors.E1033.format(name=name))
def set_dim(self, name, value): def set_dim(self, name, value):
if name == "nF": if name == "nF":
@ -421,7 +422,7 @@ cdef class precompute_hiddens:
elif name == "nO": elif name == "nO":
self.nO = value self.nO = value
else: else:
raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP") raise ValueError(Errors.E1033.format(name=name))
def __call__(self, X, bint is_train): def __call__(self, X, bint is_train):
if is_train: if is_train:

View File

@ -1,5 +1,6 @@
from .attributeruler import AttributeRuler from .attributeruler import AttributeRuler
from .dep_parser import DependencyParser from .dep_parser import DependencyParser
from .edit_tree_lemmatizer import EditTreeLemmatizer
from .entity_linker import EntityLinker from .entity_linker import EntityLinker
from .ner import EntityRecognizer from .ner import EntityRecognizer
from .entityruler import EntityRuler from .entityruler import EntityRuler

View File

@ -0,0 +1,93 @@
from libc.stdint cimport uint32_t, uint64_t
from libcpp.unordered_map cimport unordered_map
from libcpp.vector cimport vector
from ...typedefs cimport attr_t, hash_t, len_t
from ...strings cimport StringStore
cdef extern from "<algorithm>" namespace "std" nogil:
void swap[T](T& a, T& b) except + # Only available in Cython 3.
# An edit tree (Müller et al., 2015) is a tree structure that consists of
# edit operations. The two types of operations are string matches
# and string substitutions. Given an input string s and an output string t,
# subsitution and match nodes should be interpreted as follows:
#
# * Substitution node: consists of an original string and substitute string.
# If s matches the original string, then t is the substitute. Otherwise,
# the node does not apply.
# * Match node: consists of a prefix length, suffix length, prefix edit tree,
# and suffix edit tree. If s is composed of a prefix, middle part, and suffix
# with the given suffix and prefix lengths, then t is the concatenation
# prefix_tree(prefix) + middle + suffix_tree(suffix).
#
# For efficiency, we represent strings in substitution nodes as integers, with
# the actual strings stored in a StringStore. Subtrees in match nodes are stored
# as tree identifiers (rather than pointers) to simplify serialization.
cdef uint32_t NULL_TREE_ID
cdef struct MatchNodeC:
len_t prefix_len
len_t suffix_len
uint32_t prefix_tree
uint32_t suffix_tree
cdef struct SubstNodeC:
attr_t orig
attr_t subst
cdef union NodeC:
MatchNodeC match_node
SubstNodeC subst_node
cdef struct EditTreeC:
bint is_match_node
NodeC inner
cdef inline EditTreeC edittree_new_match(len_t prefix_len, len_t suffix_len,
uint32_t prefix_tree, uint32_t suffix_tree):
cdef MatchNodeC match_node = MatchNodeC(prefix_len=prefix_len,
suffix_len=suffix_len, prefix_tree=prefix_tree,
suffix_tree=suffix_tree)
cdef NodeC inner = NodeC(match_node=match_node)
return EditTreeC(is_match_node=True, inner=inner)
cdef inline EditTreeC edittree_new_subst(attr_t orig, attr_t subst):
cdef EditTreeC node
cdef SubstNodeC subst_node = SubstNodeC(orig=orig, subst=subst)
cdef NodeC inner = NodeC(subst_node=subst_node)
return EditTreeC(is_match_node=False, inner=inner)
cdef inline uint64_t edittree_hash(EditTreeC tree):
cdef MatchNodeC match_node
cdef SubstNodeC subst_node
if tree.is_match_node:
match_node = tree.inner.match_node
return hash((match_node.prefix_len, match_node.suffix_len, match_node.prefix_tree, match_node.suffix_tree))
else:
subst_node = tree.inner.subst_node
return hash((subst_node.orig, subst_node.subst))
cdef struct LCS:
int source_begin
int source_end
int target_begin
int target_end
cdef inline bint lcs_is_empty(LCS lcs):
return lcs.source_begin == 0 and lcs.source_end == 0 and lcs.target_begin == 0 and lcs.target_end == 0
cdef class EditTrees:
cdef vector[EditTreeC] trees
cdef unordered_map[hash_t, uint32_t] map
cdef StringStore strings
cpdef uint32_t add(self, str form, str lemma)
cpdef str apply(self, uint32_t tree_id, str form)
cpdef unicode tree_to_str(self, uint32_t tree_id)
cdef uint32_t _add(self, str form, str lemma)
cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces)
cdef uint32_t _tree_id(self, EditTreeC tree)

View File

@ -0,0 +1,305 @@
# cython: infer_types=True, binding=True
from cython.operator cimport dereference as deref
from libc.stdint cimport uint32_t
from libc.stdint cimport UINT32_MAX
from libc.string cimport memset
from libcpp.pair cimport pair
from libcpp.vector cimport vector
from pathlib import Path
from ...typedefs cimport hash_t
from ... import util
from ...errors import Errors
from ...strings import StringStore
from .schemas import validate_edit_tree
NULL_TREE_ID = UINT32_MAX
cdef LCS find_lcs(str source, str target):
"""
Find the longest common subsequence (LCS) between two strings. If there are
multiple LCSes, only one of them is returned.
source (str): The first string.
target (str): The second string.
RETURNS (LCS): The spans of the longest common subsequences.
"""
cdef Py_ssize_t source_len = len(source)
cdef Py_ssize_t target_len = len(target)
cdef size_t longest_align = 0;
cdef int source_idx, target_idx
cdef LCS lcs
cdef Py_UCS4 source_cp, target_cp
memset(&lcs, 0, sizeof(lcs))
cdef vector[size_t] prev_aligns = vector[size_t](target_len);
cdef vector[size_t] cur_aligns = vector[size_t](target_len);
for (source_idx, source_cp) in enumerate(source):
for (target_idx, target_cp) in enumerate(target):
if source_cp == target_cp:
if source_idx == 0 or target_idx == 0:
cur_aligns[target_idx] = 1
else:
cur_aligns[target_idx] = prev_aligns[target_idx - 1] + 1
# Check if this is the longest alignment and replace previous
# best alignment when this is the case.
if cur_aligns[target_idx] > longest_align:
longest_align = cur_aligns[target_idx]
lcs.source_begin = source_idx - longest_align + 1
lcs.source_end = source_idx + 1
lcs.target_begin = target_idx - longest_align + 1
lcs.target_end = target_idx + 1
else:
# No match, we start with a zero-length alignment.
cur_aligns[target_idx] = 0
swap(prev_aligns, cur_aligns)
return lcs
cdef class EditTrees:
"""Container for constructing and storing edit trees."""
def __init__(self, strings: StringStore):
"""Create a container for edit trees.
strings (StringStore): the string store to use."""
self.strings = strings
cpdef uint32_t add(self, str form, str lemma):
"""Add an edit tree that rewrites the given string into the given lemma.
RETURNS (int): identifier of the edit tree in the container.
"""
# Treat two empty strings as a special case. Generating an edit
# tree for identical strings results in a match node. However,
# since two empty strings have a zero-length LCS, a substitution
# node would be created. Since we do not want to clutter the
# recursive tree construction with logic for this case, handle
# it in this wrapper method.
if len(form) == 0 and len(lemma) == 0:
tree = edittree_new_match(0, 0, NULL_TREE_ID, NULL_TREE_ID)
return self._tree_id(tree)
return self._add(form, lemma)
cdef uint32_t _add(self, str form, str lemma):
cdef LCS lcs = find_lcs(form, lemma)
cdef EditTreeC tree
cdef uint32_t tree_id, prefix_tree, suffix_tree
if lcs_is_empty(lcs):
tree = edittree_new_subst(self.strings.add(form), self.strings.add(lemma))
else:
# If we have a non-empty LCS, such as "gooi" in "ge[gooi]d" and "[gooi]en",
# create edit trees for the prefix pair ("ge"/"") and the suffix pair ("d"/"en").
prefix_tree = NULL_TREE_ID
if lcs.source_begin != 0 or lcs.target_begin != 0:
prefix_tree = self.add(form[:lcs.source_begin], lemma[:lcs.target_begin])
suffix_tree = NULL_TREE_ID
if lcs.source_end != len(form) or lcs.target_end != len(lemma):
suffix_tree = self.add(form[lcs.source_end:], lemma[lcs.target_end:])
tree = edittree_new_match(lcs.source_begin, len(form) - lcs.source_end, prefix_tree, suffix_tree)
return self._tree_id(tree)
cdef uint32_t _tree_id(self, EditTreeC tree):
# If this tree has been constructed before, return its identifier.
cdef hash_t hash = edittree_hash(tree)
cdef unordered_map[hash_t, uint32_t].iterator iter = self.map.find(hash)
if iter != self.map.end():
return deref(iter).second
# The tree hasn't been seen before, store it.
cdef uint32_t tree_id = self.trees.size()
self.trees.push_back(tree)
self.map.insert(pair[hash_t, uint32_t](hash, tree_id))
return tree_id
cpdef str apply(self, uint32_t tree_id, str form):
"""Apply an edit tree to a form.
tree_id (uint32_t): the identifier of the edit tree to apply.
form (str): the form to apply the edit tree to.
RETURNS (str): the transformer form or None if the edit tree
could not be applied to the form.
"""
if tree_id >= self.trees.size():
raise IndexError(Errors.E1030)
lemma_pieces = []
try:
self._apply(tree_id, form, lemma_pieces)
except ValueError:
return None
return "".join(lemma_pieces)
cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces):
"""Recursively apply an edit tree to a form, adding pieces to
the lemma_pieces list."""
assert tree_id <= self.trees.size()
cdef EditTreeC tree = self.trees[tree_id]
cdef MatchNodeC match_node
cdef int suffix_start
if tree.is_match_node:
match_node = tree.inner.match_node
if match_node.prefix_len + match_node.suffix_len > len(form_part):
raise ValueError(Errors.E1029)
suffix_start = len(form_part) - match_node.suffix_len
if match_node.prefix_tree != NULL_TREE_ID:
self._apply(match_node.prefix_tree, form_part[:match_node.prefix_len], lemma_pieces)
lemma_pieces.append(form_part[match_node.prefix_len:suffix_start])
if match_node.suffix_tree != NULL_TREE_ID:
self._apply(match_node.suffix_tree, form_part[suffix_start:], lemma_pieces)
else:
if form_part == self.strings[tree.inner.subst_node.orig]:
lemma_pieces.append(self.strings[tree.inner.subst_node.subst])
else:
raise ValueError(Errors.E1029)
cpdef unicode tree_to_str(self, uint32_t tree_id):
"""Return the tree as a string. The tree tree string is formatted
like an S-expression. This is primarily useful for debugging. Match
nodes have the following format:
(m prefix_len suffix_len prefix_tree suffix_tree)
Substitution nodes have the following format:
(s original substitute)
tree_id (uint32_t): the identifier of the edit tree.
RETURNS (str): the tree as an S-expression.
"""
if tree_id >= self.trees.size():
raise IndexError(Errors.E1030)
cdef EditTreeC tree = self.trees[tree_id]
cdef SubstNodeC subst_node
if not tree.is_match_node:
subst_node = tree.inner.subst_node
return f"(s '{self.strings[subst_node.orig]}' '{self.strings[subst_node.subst]}')"
cdef MatchNodeC match_node = tree.inner.match_node
prefix_tree = "()"
if match_node.prefix_tree != NULL_TREE_ID:
prefix_tree = self.tree_to_str(match_node.prefix_tree)
suffix_tree = "()"
if match_node.suffix_tree != NULL_TREE_ID:
suffix_tree = self.tree_to_str(match_node.suffix_tree)
return f"(m {match_node.prefix_len} {match_node.suffix_len} {prefix_tree} {suffix_tree})"
def from_json(self, trees: list) -> "EditTrees":
self.trees.clear()
for tree in trees:
tree = _dict2tree(tree)
self.trees.push_back(tree)
self._rebuild_tree_map()
def from_bytes(self, bytes_data: bytes, *) -> "EditTrees":
def deserialize_trees(tree_dicts):
cdef EditTreeC c_tree
for tree_dict in tree_dicts:
c_tree = _dict2tree(tree_dict)
self.trees.push_back(c_tree)
deserializers = {}
deserializers["trees"] = lambda n: deserialize_trees(n)
util.from_bytes(bytes_data, deserializers, [])
self._rebuild_tree_map()
return self
def to_bytes(self, **kwargs) -> bytes:
tree_dicts = []
for tree in self.trees:
tree = _tree2dict(tree)
tree_dicts.append(tree)
serializers = {}
serializers["trees"] = lambda: tree_dicts
return util.to_bytes(serializers, [])
def to_disk(self, path, **kwargs) -> "EditTrees":
path = util.ensure_path(path)
with path.open("wb") as file_:
file_.write(self.to_bytes())
def from_disk(self, path, **kwargs) -> "EditTrees":
path = util.ensure_path(path)
if path.exists():
with path.open("rb") as file_:
data = file_.read()
return self.from_bytes(data)
return self
def __getitem__(self, idx):
return _tree2dict(self.trees[idx])
def __len__(self):
return self.trees.size()
def _rebuild_tree_map(self):
"""Rebuild the tree hash -> tree id mapping"""
cdef EditTreeC c_tree
cdef uint32_t tree_id
cdef hash_t tree_hash
self.map.clear()
for tree_id in range(self.trees.size()):
c_tree = self.trees[tree_id]
tree_hash = edittree_hash(c_tree)
self.map.insert(pair[hash_t, uint32_t](tree_hash, tree_id))
def __reduce__(self):
return (unpickle_edittrees, (self.strings, self.to_bytes()))
def unpickle_edittrees(strings, trees_data):
return EditTrees(strings).from_bytes(trees_data)
def _tree2dict(tree):
if tree["is_match_node"]:
tree = tree["inner"]["match_node"]
else:
tree = tree["inner"]["subst_node"]
return(dict(tree))
def _dict2tree(tree):
errors = validate_edit_tree(tree)
if errors:
raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
tree = dict(tree)
if "prefix_len" in tree:
tree = {"is_match_node": True, "inner": {"match_node": tree}}
else:
tree = {"is_match_node": False, "inner": {"subst_node": tree}}
return tree

View File

@ -0,0 +1,44 @@
from typing import Any, Dict, List, Union
from collections import defaultdict
from pydantic import BaseModel, Field, ValidationError
from pydantic.types import StrictBool, StrictInt, StrictStr
class MatchNodeSchema(BaseModel):
prefix_len: StrictInt = Field(..., title="Prefix length")
suffix_len: StrictInt = Field(..., title="Suffix length")
prefix_tree: StrictInt = Field(..., title="Prefix tree")
suffix_tree: StrictInt = Field(..., title="Suffix tree")
class Config:
extra = "forbid"
class SubstNodeSchema(BaseModel):
orig: Union[int, StrictStr] = Field(..., title="Original substring")
subst: Union[int, StrictStr] = Field(..., title="Replacement substring")
class Config:
extra = "forbid"
class EditTreeSchema(BaseModel):
__root__: Union[MatchNodeSchema, SubstNodeSchema]
def validate_edit_tree(obj: Dict[str, Any]) -> List[str]:
"""Validate edit tree.
obj (Dict[str, Any]): JSON-serializable data to validate.
RETURNS (List[str]): A list of error messages, if available.
"""
try:
EditTreeSchema.parse_obj(obj)
return []
except ValidationError as e:
errors = e.errors()
data = defaultdict(list)
for error in errors:
err_loc = " -> ".join([str(p) for p in error.get("loc", [])])
data[err_loc].append(error.get("msg"))
return [f"[{loc}] {', '.join(msg)}" for loc, msg in data.items()] # type: ignore[arg-type]

View File

@ -218,7 +218,7 @@ def _get_aligned_sent_starts(example):
sent_starts = [False] * len(example.x) sent_starts = [False] * len(example.x)
seen_words = set() seen_words = set()
for y_sent in example.y.sents: for y_sent in example.y.sents:
x_indices = list(align[y_sent.start : y_sent.end].dataXd) x_indices = list(align[y_sent.start : y_sent.end])
if any(x_idx in seen_words for x_idx in x_indices): if any(x_idx in seen_words for x_idx in x_indices):
# If there are any tokens in X that align across two sentences, # If there are any tokens in X that align across two sentences,
# regard the sentence annotations as missing, as we can't # regard the sentence annotations as missing, as we can't
@ -824,7 +824,7 @@ cdef class ArcEager(TransitionSystem):
for i in range(self.n_moves): for i in range(self.n_moves):
print(self.get_class_name(i), is_valid[i], costs[i]) print(self.get_class_name(i), is_valid[i], costs[i])
print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1))) print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1)))
raise ValueError("Could not find gold transition - see logs above.") raise ValueError(Errors.E1031)
def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None): def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None):
cdef int i cdef int i

View File

@ -4,6 +4,10 @@ for doing pseudo-projective parsing implementation uses the HEAD decoration
scheme. scheme.
""" """
from copy import copy from copy import copy
from libc.limits cimport INT_MAX
from libc.stdlib cimport abs
from libcpp cimport bool
from libcpp.vector cimport vector
from ...tokens.doc cimport Doc, set_children_from_heads from ...tokens.doc cimport Doc, set_children_from_heads
@ -41,13 +45,18 @@ def contains_cycle(heads):
def is_nonproj_arc(tokenid, heads): def is_nonproj_arc(tokenid, heads):
cdef vector[int] c_heads = _heads_to_c(heads)
return _is_nonproj_arc(tokenid, c_heads)
cdef bool _is_nonproj_arc(int tokenid, const vector[int]& heads) nogil:
# definition (e.g. Havelka 2007): an arc h -> d, h < d is non-projective # definition (e.g. Havelka 2007): an arc h -> d, h < d is non-projective
# if there is a token k, h < k < d such that h is not # if there is a token k, h < k < d such that h is not
# an ancestor of k. Same for h -> d, h > d # an ancestor of k. Same for h -> d, h > d
head = heads[tokenid] head = heads[tokenid]
if head == tokenid: # root arcs cannot be non-projective if head == tokenid: # root arcs cannot be non-projective
return False return False
elif head is None: # unattached tokens cannot be non-projective elif head < 0: # unattached tokens cannot be non-projective
return False return False
cdef int start, end cdef int start, end
@ -56,19 +65,29 @@ def is_nonproj_arc(tokenid, heads):
else: else:
start, end = (tokenid+1, head) start, end = (tokenid+1, head)
for k in range(start, end): for k in range(start, end):
for ancestor in ancestors(k, heads): if _has_head_as_ancestor(k, head, heads):
if ancestor is None: # for unattached tokens/subtrees continue
break
elif ancestor == head: # normal case: k dominated by h
break
else: # head not in ancestors: d -> h is non-projective else: # head not in ancestors: d -> h is non-projective
return True return True
return False return False
cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads) nogil:
ancestor = tokenid
cnt = 0
while cnt < heads.size():
if heads[ancestor] == head or heads[ancestor] < 0:
return True
ancestor = heads[ancestor]
cnt += 1
return False
def is_nonproj_tree(heads): def is_nonproj_tree(heads):
cdef vector[int] c_heads = _heads_to_c(heads)
# a tree is non-projective if at least one arc is non-projective # a tree is non-projective if at least one arc is non-projective
return any(is_nonproj_arc(word, heads) for word in range(len(heads))) return any(_is_nonproj_arc(word, c_heads) for word in range(len(heads)))
def decompose(label): def decompose(label):
@ -98,16 +117,31 @@ def projectivize(heads, labels):
# tree, i.e. connected and cycle-free. Returns a new pair (heads, labels) # tree, i.e. connected and cycle-free. Returns a new pair (heads, labels)
# which encode a projective and decorated tree. # which encode a projective and decorated tree.
proj_heads = copy(heads) proj_heads = copy(heads)
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
if smallest_np_arc is None: # this sentence is already projective cdef int new_head
cdef vector[int] c_proj_heads = _heads_to_c(proj_heads)
cdef int smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
if smallest_np_arc == -1: # this sentence is already projective
return proj_heads, copy(labels) return proj_heads, copy(labels)
while smallest_np_arc is not None: while smallest_np_arc != -1:
_lift(smallest_np_arc, proj_heads) new_head = _lift(smallest_np_arc, proj_heads)
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads) c_proj_heads[smallest_np_arc] = new_head
smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
deco_labels = _decorate(heads, proj_heads, labels) deco_labels = _decorate(heads, proj_heads, labels)
return proj_heads, deco_labels return proj_heads, deco_labels
cdef vector[int] _heads_to_c(heads):
cdef vector[int] c_heads;
for head in heads:
if head == None:
c_heads.push_back(-1)
else:
assert head < len(heads)
c_heads.push_back(head)
return c_heads
cpdef deprojectivize(Doc doc): cpdef deprojectivize(Doc doc):
# Reattach arcs with decorated labels (following HEAD scheme). For each # Reattach arcs with decorated labels (following HEAD scheme). For each
# decorated arc X||Y, search top-down, left-to-right, breadth-first until # decorated arc X||Y, search top-down, left-to-right, breadth-first until
@ -137,27 +171,38 @@ def _decorate(heads, proj_heads, labels):
deco_labels.append(labels[tokenid]) deco_labels.append(labels[tokenid])
return deco_labels return deco_labels
def get_smallest_nonproj_arc_slow(heads):
cdef vector[int] c_heads = _heads_to_c(heads)
return _get_smallest_nonproj_arc(c_heads)
def _get_smallest_nonproj_arc(heads):
cdef int _get_smallest_nonproj_arc(const vector[int]& heads) nogil:
# return the smallest non-proj arc or None # return the smallest non-proj arc or None
# where size is defined as the distance between dep and head # where size is defined as the distance between dep and head
# and ties are broken left to right # and ties are broken left to right
smallest_size = float('inf') cdef int smallest_size = INT_MAX
smallest_np_arc = None cdef int smallest_np_arc = -1
for tokenid, head in enumerate(heads): cdef int size
cdef int tokenid
cdef int head
for tokenid in range(heads.size()):
head = heads[tokenid]
size = abs(tokenid-head) size = abs(tokenid-head)
if size < smallest_size and is_nonproj_arc(tokenid, heads): if size < smallest_size and _is_nonproj_arc(tokenid, heads):
smallest_size = size smallest_size = size
smallest_np_arc = tokenid smallest_np_arc = tokenid
return smallest_np_arc return smallest_np_arc
def _lift(tokenid, heads): cpdef int _lift(tokenid, heads):
# reattaches a word to it's grandfather # reattaches a word to it's grandfather
head = heads[tokenid] head = heads[tokenid]
ghead = heads[head] ghead = heads[head]
cdef int new_head = ghead if head != ghead else tokenid
# attach to ghead if head isn't attached to root else attach to root # attach to ghead if head isn't attached to root else attach to root
heads[tokenid] = ghead if head != ghead else tokenid heads[tokenid] = new_head
return new_head
def _find_new_head(token, headlabel): def _find_new_head(token, headlabel):

View File

@ -0,0 +1,379 @@
from typing import cast, Any, Callable, Dict, Iterable, List, Optional
from typing import Sequence, Tuple, Union
from collections import Counter
from copy import deepcopy
from itertools import islice
import numpy as np
import srsly
from thinc.api import Config, Model, SequenceCategoricalCrossentropy
from thinc.types import Floats2d, Ints1d, Ints2d
from ._edit_tree_internals.edit_trees import EditTrees
from ._edit_tree_internals.schemas import validate_edit_tree
from .lemmatizer import lemmatizer_score
from .trainable_pipe import TrainablePipe
from ..errors import Errors
from ..language import Language
from ..tokens import Doc
from ..training import Example, validate_examples, validate_get_examples
from ..vocab import Vocab
from .. import util
default_model_config = """
[model]
@architectures = "spacy.Tagger.v2"
[model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
"""
DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
@Language.factory(
"trainable_lemmatizer",
assigns=["token.lemma"],
requires=[],
default_config={
"model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
"backoff": "orth",
"min_tree_freq": 3,
"overwrite": False,
"top_k": 1,
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
},
default_score_weights={"lemma_acc": 1.0},
)
def make_edit_tree_lemmatizer(
nlp: Language,
name: str,
model: Model,
backoff: Optional[str],
min_tree_freq: int,
overwrite: bool,
top_k: int,
scorer: Optional[Callable],
):
"""Construct an EditTreeLemmatizer component."""
return EditTreeLemmatizer(
nlp.vocab,
model,
name,
backoff=backoff,
min_tree_freq=min_tree_freq,
overwrite=overwrite,
top_k=top_k,
scorer=scorer,
)
class EditTreeLemmatizer(TrainablePipe):
"""
Lemmatizer that lemmatizes each word using a predicted edit tree.
"""
def __init__(
self,
vocab: Vocab,
model: Model,
name: str = "trainable_lemmatizer",
*,
backoff: Optional[str] = "orth",
min_tree_freq: int = 3,
overwrite: bool = False,
top_k: int = 1,
scorer: Optional[Callable] = lemmatizer_score,
):
"""
Construct an edit tree lemmatizer.
backoff (Optional[str]): backoff to use when the predicted edit trees
are not applicable. Must be an attribute of Token or None (leave the
lemma unset).
min_tree_freq (int): prune trees that are applied less than this
frequency in the training data.
overwrite (bool): overwrite existing lemma annotations.
top_k (int): try to apply at most the k most probable edit trees.
"""
self.vocab = vocab
self.model = model
self.name = name
self.backoff = backoff
self.min_tree_freq = min_tree_freq
self.overwrite = overwrite
self.top_k = top_k
self.trees = EditTrees(self.vocab.strings)
self.tree2label: Dict[int, int] = {}
self.cfg: Dict[str, Any] = {"labels": []}
self.scorer = scorer
def get_loss(
self, examples: Iterable[Example], scores: List[Floats2d]
) -> Tuple[float, List[Floats2d]]:
validate_examples(examples, "EditTreeLemmatizer.get_loss")
loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
truths = []
for eg in examples:
eg_truths = []
for (predicted, gold_lemma) in zip(
eg.predicted, eg.get_aligned("LEMMA", as_string=True)
):
if gold_lemma is None:
label = -1
else:
tree_id = self.trees.add(predicted.text, gold_lemma)
label = self.tree2label.get(tree_id, 0)
eg_truths.append(label)
truths.append(eg_truths)
d_scores, loss = loss_func(scores, truths) # type: ignore
if self.model.ops.xp.isnan(loss):
raise ValueError(Errors.E910.format(name=self.name))
return float(loss), d_scores
def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
n_docs = len(list(docs))
if not any(len(doc) for doc in docs):
# Handle cases where there are no tokens in any docs.
n_labels = len(self.cfg["labels"])
guesses: List[Ints2d] = [
self.model.ops.alloc((0, n_labels), dtype="i") for doc in docs
]
assert len(guesses) == n_docs
return guesses
scores = self.model.predict(docs)
assert len(scores) == n_docs
guesses = self._scores2guesses(docs, scores)
assert len(guesses) == n_docs
return guesses
def _scores2guesses(self, docs, scores):
guesses = []
for doc, doc_scores in zip(docs, scores):
if self.top_k == 1:
doc_guesses = doc_scores.argmax(axis=1).reshape(-1, 1)
else:
doc_guesses = np.argsort(doc_scores)[..., : -self.top_k - 1 : -1]
if not isinstance(doc_guesses, np.ndarray):
doc_guesses = doc_guesses.get()
doc_compat_guesses = []
for token, candidates in zip(doc, doc_guesses):
tree_id = -1
for candidate in candidates:
candidate_tree_id = self.cfg["labels"][candidate]
if self.trees.apply(candidate_tree_id, token.text) is not None:
tree_id = candidate_tree_id
break
doc_compat_guesses.append(tree_id)
guesses.append(np.array(doc_compat_guesses))
return guesses
def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
for i, doc in enumerate(docs):
doc_tree_ids = batch_tree_ids[i]
if hasattr(doc_tree_ids, "get"):
doc_tree_ids = doc_tree_ids.get()
for j, tree_id in enumerate(doc_tree_ids):
if self.overwrite or doc[j].lemma == 0:
# If no applicable tree could be found during prediction,
# the special identifier -1 is used. Otherwise the tree
# is guaranteed to be applicable.
if tree_id == -1:
if self.backoff is not None:
doc[j].lemma = getattr(doc[j], self.backoff)
else:
lemma = self.trees.apply(tree_id, doc[j].text)
doc[j].lemma_ = lemma
@property
def labels(self) -> Tuple[int, ...]:
"""Returns the labels currently added to the component."""
return tuple(self.cfg["labels"])
@property
def hide_labels(self) -> bool:
return True
@property
def label_data(self) -> Dict:
trees = []
for tree_id in range(len(self.trees)):
tree = self.trees[tree_id]
if "orig" in tree:
tree["orig"] = self.vocab.strings[tree["orig"]]
if "subst" in tree:
tree["subst"] = self.vocab.strings[tree["subst"]]
trees.append(tree)
return dict(trees=trees, labels=tuple(self.cfg["labels"]))
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
labels: Optional[Dict] = None,
):
validate_get_examples(get_examples, "EditTreeLemmatizer.initialize")
if labels is None:
self._labels_from_data(get_examples)
else:
self._add_labels(labels)
# Sample for the model.
doc_sample = []
label_sample = []
for example in islice(get_examples(), 10):
doc_sample.append(example.x)
gold_labels: List[List[float]] = []
for token in example.reference:
if token.lemma == 0:
gold_label = None
else:
gold_label = self._pair2label(token.text, token.lemma_)
gold_labels.append(
[
1.0 if label == gold_label else 0.0
for label in self.cfg["labels"]
]
)
gold_labels = cast(Floats2d, gold_labels)
label_sample.append(self.model.ops.asarray(gold_labels, dtype="float32"))
self._require_labels()
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(X=doc_sample, Y=label_sample)
def from_bytes(self, bytes_data, *, exclude=tuple()):
deserializers = {
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
"model": lambda b: self.model.from_bytes(b),
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
"trees": lambda b: self.trees.from_bytes(b),
}
util.from_bytes(bytes_data, deserializers, exclude)
return self
def to_bytes(self, *, exclude=tuple()):
serializers = {
"cfg": lambda: srsly.json_dumps(self.cfg),
"model": lambda: self.model.to_bytes(),
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
"trees": lambda: self.trees.to_bytes(),
}
return util.to_bytes(serializers, exclude)
def to_disk(self, path, exclude=tuple()):
path = util.ensure_path(path)
serializers = {
"cfg": lambda p: srsly.write_json(p, self.cfg),
"model": lambda p: self.model.to_disk(p),
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
"trees": lambda p: self.trees.to_disk(p),
}
util.to_disk(path, serializers, exclude)
def from_disk(self, path, exclude=tuple()):
def load_model(p):
try:
with open(p, "rb") as mfile:
self.model.from_bytes(mfile.read())
except AttributeError:
raise ValueError(Errors.E149) from None
deserializers = {
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
"model": load_model,
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
"trees": lambda p: self.trees.from_disk(p),
}
util.from_disk(path, deserializers, exclude)
return self
def _add_labels(self, labels: Dict):
if "labels" not in labels:
raise ValueError(Errors.E857.format(name="labels"))
if "trees" not in labels:
raise ValueError(Errors.E857.format(name="trees"))
self.cfg["labels"] = list(labels["labels"])
trees = []
for tree in labels["trees"]:
errors = validate_edit_tree(tree)
if errors:
raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
tree = dict(tree)
if "orig" in tree:
tree["orig"] = self.vocab.strings[tree["orig"]]
if "orig" in tree:
tree["subst"] = self.vocab.strings[tree["subst"]]
trees.append(tree)
self.trees.from_json(trees)
for label, tree in enumerate(self.labels):
self.tree2label[tree] = label
def _labels_from_data(self, get_examples: Callable[[], Iterable[Example]]):
# Count corpus tree frequencies in ad-hoc storage to avoid cluttering
# the final pipe/string store.
vocab = Vocab()
trees = EditTrees(vocab.strings)
tree_freqs: Counter = Counter()
repr_pairs: Dict = {}
for example in get_examples():
for token in example.reference:
if token.lemma != 0:
tree_id = trees.add(token.text, token.lemma_)
tree_freqs[tree_id] += 1
repr_pairs[tree_id] = (token.text, token.lemma_)
# Construct trees that make the frequency cut-off using representative
# form - token pairs.
for tree_id, freq in tree_freqs.items():
if freq >= self.min_tree_freq:
form, lemma = repr_pairs[tree_id]
self._pair2label(form, lemma, add_label=True)
def _pair2label(self, form, lemma, add_label=False):
"""
Look up the edit tree identifier for a form/label pair. If the edit
tree is unknown and "add_label" is set, the edit tree will be added to
the labels.
"""
tree_id = self.trees.add(form, lemma)
if tree_id not in self.tree2label:
if not add_label:
return None
self.tree2label[tree_id] = len(self.cfg["labels"])
self.cfg["labels"].append(tree_id)
return self.tree2label[tree_id]

View File

@ -6,17 +6,17 @@ import srsly
import random import random
from thinc.api import CosineDistance, Model, Optimizer, Config from thinc.api import CosineDistance, Model, Optimizer, Config
from thinc.api import set_dropout_rate from thinc.api import set_dropout_rate
import warnings
from ..kb import KnowledgeBase, Candidate from ..kb import KnowledgeBase, Candidate
from ..ml import empty_kb from ..ml import empty_kb
from ..tokens import Doc, Span from ..tokens import Doc, Span
from .pipe import deserialize_config from .pipe import deserialize_config
from .legacy.entity_linker import EntityLinker_v1
from .trainable_pipe import TrainablePipe from .trainable_pipe import TrainablePipe
from ..language import Language from ..language import Language
from ..vocab import Vocab from ..vocab import Vocab
from ..training import Example, validate_examples, validate_get_examples from ..training import Example, validate_examples, validate_get_examples
from ..errors import Errors, Warnings from ..errors import Errors
from ..util import SimpleFrozenList, registry from ..util import SimpleFrozenList, registry
from .. import util from .. import util
from ..scorer import Scorer from ..scorer import Scorer
@ -26,7 +26,7 @@ BACKWARD_OVERWRITE = True
default_model_config = """ default_model_config = """
[model] [model]
@architectures = "spacy.EntityLinker.v1" @architectures = "spacy.EntityLinker.v2"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2" @architectures = "spacy.HashEmbedCNN.v2"
@ -55,6 +55,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"}, "get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
"overwrite": True, "overwrite": True,
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"}, "scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
"use_gold_ents": True,
}, },
default_score_weights={ default_score_weights={
"nel_micro_f": 1.0, "nel_micro_f": 1.0,
@ -75,6 +76,7 @@ def make_entity_linker(
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]], get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool, overwrite: bool,
scorer: Optional[Callable], scorer: Optional[Callable],
use_gold_ents: bool,
): ):
"""Construct an EntityLinker component. """Construct an EntityLinker component.
@ -90,6 +92,22 @@ def make_entity_linker(
produces a list of candidates, given a certain knowledge base and a textual mention. produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. scorer (Optional[Callable]): The scoring method.
""" """
if not model.attrs.get("include_span_maker", False):
# The only difference in arguments here is that use_gold_ents is not available
return EntityLinker_v1(
nlp.vocab,
model,
name,
labels_discard=labels_discard,
n_sents=n_sents,
incl_prior=incl_prior,
incl_context=incl_context,
entity_vector_length=entity_vector_length,
get_candidates=get_candidates,
overwrite=overwrite,
scorer=scorer,
)
return EntityLinker( return EntityLinker(
nlp.vocab, nlp.vocab,
model, model,
@ -102,6 +120,7 @@ def make_entity_linker(
get_candidates=get_candidates, get_candidates=get_candidates,
overwrite=overwrite, overwrite=overwrite,
scorer=scorer, scorer=scorer,
use_gold_ents=use_gold_ents,
) )
@ -136,6 +155,7 @@ class EntityLinker(TrainablePipe):
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]], get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool = BACKWARD_OVERWRITE, overwrite: bool = BACKWARD_OVERWRITE,
scorer: Optional[Callable] = entity_linker_score, scorer: Optional[Callable] = entity_linker_score,
use_gold_ents: bool,
) -> None: ) -> None:
"""Initialize an entity linker. """Initialize an entity linker.
@ -152,6 +172,8 @@ class EntityLinker(TrainablePipe):
produces a list of candidates, given a certain knowledge base and a textual mention. produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. Defaults to scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_links. Scorer.score_links.
use_gold_ents (bool): Whether to copy entities from gold docs or not. If false, another
component must provide entity annotations.
DOCS: https://spacy.io/api/entitylinker#init DOCS: https://spacy.io/api/entitylinker#init
""" """
@ -169,6 +191,7 @@ class EntityLinker(TrainablePipe):
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'. # create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab) self.kb = empty_kb(entity_vector_length)(self.vocab)
self.scorer = scorer self.scorer = scorer
self.use_gold_ents = use_gold_ents
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]): def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will """Define the KB of this pipe by providing a function that will
@ -212,14 +235,48 @@ class EntityLinker(TrainablePipe):
doc_sample = [] doc_sample = []
vector_sample = [] vector_sample = []
for example in islice(get_examples(), 10): for example in islice(get_examples(), 10):
doc_sample.append(example.x) doc = example.x
if self.use_gold_ents:
doc.ents = example.y.ents
doc_sample.append(doc)
vector_sample.append(self.model.ops.alloc1f(nO)) vector_sample.append(self.model.ops.alloc1f(nO))
assert len(doc_sample) > 0, Errors.E923.format(name=self.name) assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(vector_sample) > 0, Errors.E923.format(name=self.name) assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
# XXX In order for size estimation to work, there has to be at least
# one entity. It's not used for training so it doesn't have to be real,
# so we add a fake one if none are present.
# We can't use Doc.has_annotation here because it can be True for docs
# that have been through an NER component but got no entities.
has_annotations = any([doc.ents for doc in doc_sample])
if not has_annotations:
doc = doc_sample[0]
ent = doc[0:1]
ent.label_ = "XXX"
doc.ents = (ent,)
self.model.initialize( self.model.initialize(
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32") X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
) )
if not has_annotations:
# Clean up dummy annotation
doc.ents = []
def batch_has_learnable_example(self, examples):
"""Check if a batch contains a learnable example.
If one isn't present, then the update step needs to be skipped.
"""
for eg in examples:
for ent in eg.predicted.ents:
candidates = list(self.get_candidates(self.kb, ent))
if candidates:
return True
return False
def update( def update(
self, self,
examples: Iterable[Example], examples: Iterable[Example],
@ -247,35 +304,29 @@ class EntityLinker(TrainablePipe):
if not examples: if not examples:
return losses return losses
validate_examples(examples, "EntityLinker.update") validate_examples(examples, "EntityLinker.update")
sentence_docs = []
for eg in examples:
sentences = [s for s in eg.reference.sents]
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents:
# KB ID of the first token is the same as the whole span
kb_id = kb_ids[ent.start]
if kb_id:
try:
# find the sentence in the list of sentences.
sent_index = sentences.index(ent.sent)
except AttributeError:
# Catch the exception when ent.sent is None and provide a user-friendly warning
raise RuntimeError(Errors.E030) from None
# get n previous sentences, if there are any
start_sentence = max(0, sent_index - self.n_sents)
# get n posterior sentences, or as many < n as there are
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
# get token positions
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
# append that span as a doc to training
sent_doc = eg.predicted[start_token:end_token].as_doc()
sentence_docs.append(sent_doc)
set_dropout_rate(self.model, drop) set_dropout_rate(self.model, drop)
if not sentence_docs: docs = [eg.predicted for eg in examples]
warnings.warn(Warnings.W093.format(name="Entity Linker")) # save to restore later
old_ents = [doc.ents for doc in docs]
for doc, ex in zip(docs, examples):
if self.use_gold_ents:
doc.ents = ex.reference.ents
else:
# only keep matching ents
doc.ents = ex.get_matching_ents()
# make sure we have something to learn from, if not, short-circuit
if not self.batch_has_learnable_example(examples):
return losses return losses
sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
sentence_encodings, bp_context = self.model.begin_update(docs)
# now restore the ents
for doc, old in zip(docs, old_ents):
doc.ents = old
loss, d_scores = self.get_loss( loss, d_scores = self.get_loss(
sentence_encodings=sentence_encodings, examples=examples sentence_encodings=sentence_encodings, examples=examples
) )
@ -288,24 +339,38 @@ class EntityLinker(TrainablePipe):
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d): def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
validate_examples(examples, "EntityLinker.get_loss") validate_examples(examples, "EntityLinker.get_loss")
entity_encodings = [] entity_encodings = []
eidx = 0 # indices in gold entities to keep
keep_ents = [] # indices in sentence_encodings to keep
for eg in examples: for eg in examples:
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True) kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents: for ent in eg.reference.ents:
kb_id = kb_ids[ent.start] kb_id = kb_ids[ent.start]
if kb_id: if kb_id:
entity_encoding = self.kb.get_vector(kb_id) entity_encoding = self.kb.get_vector(kb_id)
entity_encodings.append(entity_encoding) entity_encodings.append(entity_encoding)
keep_ents.append(eidx)
eidx += 1
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32") entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
if sentence_encodings.shape != entity_encodings.shape: selected_encodings = sentence_encodings[keep_ents]
# If the entity encodings list is empty, then
if selected_encodings.shape != entity_encodings.shape:
err = Errors.E147.format( err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up" method="get_loss", msg="gold entities do not match up"
) )
raise RuntimeError(err) raise RuntimeError(err)
# TODO: fix typing issue here # TODO: fix typing issue here
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore gradients = self.distance.get_grad(selected_encodings, entity_encodings) # type: ignore
loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore # to match the input size, we need to give a zero gradient for items not in the kb
out = self.model.ops.alloc2f(*sentence_encodings.shape)
out[keep_ents] = gradients
loss = self.distance.get_loss(selected_encodings, entity_encodings) # type: ignore
loss = loss / len(entity_encodings) loss = loss / len(entity_encodings)
return float(loss), gradients return float(loss), out
def predict(self, docs: Iterable[Doc]) -> List[str]: def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them. """Apply the pipeline's model to a batch of docs, without modifying them.

View File

@ -0,0 +1,3 @@
from .entity_linker import EntityLinker_v1
__all__ = ["EntityLinker_v1"]

View File

@ -0,0 +1,427 @@
# This file is present to provide a prior version of the EntityLinker component
# for backwards compatability. For details see #9669.
from typing import Optional, Iterable, Callable, Dict, Union, List, Any
from thinc.types import Floats2d
from pathlib import Path
from itertools import islice
import srsly
import random
from thinc.api import CosineDistance, Model, Optimizer, Config
from thinc.api import set_dropout_rate
import warnings
from ...kb import KnowledgeBase, Candidate
from ...ml import empty_kb
from ...tokens import Doc, Span
from ..pipe import deserialize_config
from ..trainable_pipe import TrainablePipe
from ...language import Language
from ...vocab import Vocab
from ...training import Example, validate_examples, validate_get_examples
from ...errors import Errors, Warnings
from ...util import SimpleFrozenList, registry
from ... import util
from ...scorer import Scorer
# See #9050
BACKWARD_OVERWRITE = True
def entity_linker_score(examples, **kwargs):
return Scorer.score_links(examples, negative_labels=[EntityLinker_v1.NIL], **kwargs)
class EntityLinker_v1(TrainablePipe):
"""Pipeline component for named entity linking.
DOCS: https://spacy.io/api/entitylinker
"""
NIL = "NIL" # string used to refer to a non-existing link
def __init__(
self,
vocab: Vocab,
model: Model,
name: str = "entity_linker",
*,
labels_discard: Iterable[str],
n_sents: int,
incl_prior: bool,
incl_context: bool,
entity_vector_length: int,
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
overwrite: bool = BACKWARD_OVERWRITE,
scorer: Optional[Callable] = entity_linker_score,
) -> None:
"""Initialize an entity linker.
vocab (Vocab): The shared vocabulary.
model (thinc.api.Model): The Thinc Model powering the pipeline component.
name (str): The component instance name, used to add entries to the
losses during training.
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
n_sents (int): The number of neighbouring sentences to take into account.
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
incl_context (bool): Whether or not to include the local context in the model.
entity_vector_length (int): Size of encoding vectors in the KB.
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
produces a list of candidates, given a certain knowledge base and a textual mention.
scorer (Optional[Callable]): The scoring method. Defaults to
Scorer.score_links.
DOCS: https://spacy.io/api/entitylinker#init
"""
self.vocab = vocab
self.model = model
self.name = name
self.labels_discard = list(labels_discard)
self.n_sents = n_sents
self.incl_prior = incl_prior
self.incl_context = incl_context
self.get_candidates = get_candidates
self.cfg: Dict[str, Any] = {"overwrite": overwrite}
self.distance = CosineDistance(normalize=False)
# how many neighbour sentences to take into account
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
self.kb = empty_kb(entity_vector_length)(self.vocab)
self.scorer = scorer
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
"""Define the KB of this pipe by providing a function that will
create it using this object's vocab."""
if not callable(kb_loader):
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
self.kb = kb_loader(self.vocab)
def validate_kb(self) -> None:
# Raise an error if the knowledge base is not initialized.
if self.kb is None:
raise ValueError(Errors.E1018.format(name=self.name))
if len(self.kb) == 0:
raise ValueError(Errors.E139.format(name=self.name))
def initialize(
self,
get_examples: Callable[[], Iterable[Example]],
*,
nlp: Optional[Language] = None,
kb_loader: Optional[Callable[[Vocab], KnowledgeBase]] = None,
):
"""Initialize the pipe for training, using a representative set
of data examples.
get_examples (Callable[[], Iterable[Example]]): Function that
returns a representative sample of gold-standard Example objects.
nlp (Language): The current nlp object the component is part of.
kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates a KnowledgeBase from a Vocab instance.
Note that providing this argument, will overwrite all data accumulated in the current KB.
Use this only when loading a KB as-such from file.
DOCS: https://spacy.io/api/entitylinker#initialize
"""
validate_get_examples(get_examples, "EntityLinker_v1.initialize")
if kb_loader is not None:
self.set_kb(kb_loader)
self.validate_kb()
nO = self.kb.entity_vector_length
doc_sample = []
vector_sample = []
for example in islice(get_examples(), 10):
doc_sample.append(example.x)
vector_sample.append(self.model.ops.alloc1f(nO))
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
self.model.initialize(
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
)
def update(
self,
examples: Iterable[Example],
*,
drop: float = 0.0,
sgd: Optional[Optimizer] = None,
losses: Optional[Dict[str, float]] = None,
) -> Dict[str, float]:
"""Learn from a batch of documents and gold-standard information,
updating the pipe's model. Delegates to predict and get_loss.
examples (Iterable[Example]): A batch of Example objects.
drop (float): The dropout rate.
sgd (thinc.api.Optimizer): The optimizer.
losses (Dict[str, float]): Optional record of the loss during training.
Updated using the component name as the key.
RETURNS (Dict[str, float]): The updated losses dictionary.
DOCS: https://spacy.io/api/entitylinker#update
"""
self.validate_kb()
if losses is None:
losses = {}
losses.setdefault(self.name, 0.0)
if not examples:
return losses
validate_examples(examples, "EntityLinker_v1.update")
sentence_docs = []
for eg in examples:
sentences = [s for s in eg.reference.sents]
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents:
# KB ID of the first token is the same as the whole span
kb_id = kb_ids[ent.start]
if kb_id:
try:
# find the sentence in the list of sentences.
sent_index = sentences.index(ent.sent)
except AttributeError:
# Catch the exception when ent.sent is None and provide a user-friendly warning
raise RuntimeError(Errors.E030) from None
# get n previous sentences, if there are any
start_sentence = max(0, sent_index - self.n_sents)
# get n posterior sentences, or as many < n as there are
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
# get token positions
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
# append that span as a doc to training
sent_doc = eg.predicted[start_token:end_token].as_doc()
sentence_docs.append(sent_doc)
set_dropout_rate(self.model, drop)
if not sentence_docs:
warnings.warn(Warnings.W093.format(name="Entity Linker"))
return losses
sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
loss, d_scores = self.get_loss(
sentence_encodings=sentence_encodings, examples=examples
)
bp_context(d_scores)
if sgd is not None:
self.finish_update(sgd)
losses[self.name] += loss
return losses
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
validate_examples(examples, "EntityLinker_v1.get_loss")
entity_encodings = []
for eg in examples:
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
for ent in eg.reference.ents:
kb_id = kb_ids[ent.start]
if kb_id:
entity_encoding = self.kb.get_vector(kb_id)
entity_encodings.append(entity_encoding)
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
if sentence_encodings.shape != entity_encodings.shape:
err = Errors.E147.format(
method="get_loss", msg="gold entities do not match up"
)
raise RuntimeError(err)
# TODO: fix typing issue here
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore
loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
loss = loss / len(entity_encodings)
return float(loss), gradients
def predict(self, docs: Iterable[Doc]) -> List[str]:
"""Apply the pipeline's model to a batch of docs, without modifying them.
Returns the KB IDs for each entity in each doc, including NIL if there is
no prediction.
docs (Iterable[Doc]): The documents to predict.
RETURNS (List[str]): The models prediction for each document.
DOCS: https://spacy.io/api/entitylinker#predict
"""
self.validate_kb()
entity_count = 0
final_kb_ids: List[str] = []
if not docs:
return final_kb_ids
if isinstance(docs, Doc):
docs = [docs]
for i, doc in enumerate(docs):
sentences = [s for s in doc.sents]
if len(doc) > 0:
# Looping through each entity (TODO: rewrite)
for ent in doc.ents:
sent = ent.sent
sent_index = sentences.index(sent)
assert sent_index >= 0
# get n_neighbour sentences, clipped to the length of the document
start_sentence = max(0, sent_index - self.n_sents)
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
start_token = sentences[start_sentence].start
end_token = sentences[end_sentence].end
sent_doc = doc[start_token:end_token].as_doc()
# currently, the context is the same for each entity in a sentence (should be refined)
xp = self.model.ops.xp
if self.incl_context:
sentence_encoding = self.model.predict([sent_doc])[0]
sentence_encoding_t = sentence_encoding.T
sentence_norm = xp.linalg.norm(sentence_encoding_t)
entity_count += 1
if ent.label_ in self.labels_discard:
# ignoring this entity - setting to NIL
final_kb_ids.append(self.NIL)
else:
candidates = list(self.get_candidates(self.kb, ent))
if not candidates:
# no prediction possible for this entity - setting to NIL
final_kb_ids.append(self.NIL)
elif len(candidates) == 1:
# shortcut for efficiency reasons: take the 1 candidate
# TODO: thresholding
final_kb_ids.append(candidates[0].entity_)
else:
random.shuffle(candidates)
# set all prior probabilities to 0 if incl_prior=False
prior_probs = xp.asarray([c.prior_prob for c in candidates])
if not self.incl_prior:
prior_probs = xp.asarray([0.0 for _ in candidates])
scores = prior_probs
# add in similarity from the context
if self.incl_context:
entity_encodings = xp.asarray(
[c.entity_vector for c in candidates]
)
entity_norm = xp.linalg.norm(entity_encodings, axis=1)
if len(entity_encodings) != len(prior_probs):
raise RuntimeError(
Errors.E147.format(
method="predict",
msg="vectors not of equal length",
)
)
# cosine similarity
sims = xp.dot(entity_encodings, sentence_encoding_t) / (
sentence_norm * entity_norm
)
if sims.shape != prior_probs.shape:
raise ValueError(Errors.E161)
scores = prior_probs + sims - (prior_probs * sims)
# TODO: thresholding
best_index = scores.argmax().item()
best_candidate = candidates[best_index]
final_kb_ids.append(best_candidate.entity_)
if not (len(final_kb_ids) == entity_count):
err = Errors.E147.format(
method="predict", msg="result variables not of equal length"
)
raise RuntimeError(err)
return final_kb_ids
def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
"""Modify a batch of documents, using pre-computed scores.
docs (Iterable[Doc]): The documents to modify.
kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict.
DOCS: https://spacy.io/api/entitylinker#set_annotations
"""
count_ents = len([ent for doc in docs for ent in doc.ents])
if count_ents != len(kb_ids):
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
i = 0
overwrite = self.cfg["overwrite"]
for doc in docs:
for ent in doc.ents:
kb_id = kb_ids[i]
i += 1
for token in ent:
if token.ent_kb_id == 0 or overwrite:
token.ent_kb_id_ = kb_id
def to_bytes(self, *, exclude=tuple()):
"""Serialize the pipe to a bytestring.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (bytes): The serialized object.
DOCS: https://spacy.io/api/entitylinker#to_bytes
"""
self._validate_serialization_attrs()
serialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
serialize["kb"] = self.kb.to_bytes
serialize["model"] = self.model.to_bytes
return util.to_bytes(serialize, exclude)
def from_bytes(self, bytes_data, *, exclude=tuple()):
"""Load the pipe from a bytestring.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (TrainablePipe): The loaded object.
DOCS: https://spacy.io/api/entitylinker#from_bytes
"""
self._validate_serialization_attrs()
def load_model(b):
try:
self.model.from_bytes(b)
except AttributeError:
raise ValueError(Errors.E149) from None
deserialize = {}
if hasattr(self, "cfg") and self.cfg is not None:
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
deserialize["model"] = load_model
util.from_bytes(bytes_data, deserialize, exclude)
return self
def to_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> None:
"""Serialize the pipe to disk.
path (str / Path): Path to a directory.
exclude (Iterable[str]): String names of serialization fields to exclude.
DOCS: https://spacy.io/api/entitylinker#to_disk
"""
serialize = {}
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
serialize["kb"] = lambda p: self.kb.to_disk(p)
serialize["model"] = lambda p: self.model.to_disk(p)
util.to_disk(path, serialize, exclude)
def from_disk(
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
) -> "EntityLinker_v1":
"""Load the pipe from disk. Modifies the object in place and returns it.
path (str / Path): Path to a directory.
exclude (Iterable[str]): String names of serialization fields to exclude.
RETURNS (EntityLinker): The modified EntityLinker object.
DOCS: https://spacy.io/api/entitylinker#from_disk
"""
def load_model(p):
try:
with p.open("rb") as infile:
self.model.from_bytes(infile.read())
except AttributeError:
raise ValueError(Errors.E149) from None
deserialize: Dict[str, Callable[[Any], Any]] = {}
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
deserialize["kb"] = lambda p: self.kb.from_disk(p)
deserialize["model"] = load_model
util.from_disk(path, deserialize, exclude)
return self
def rehearse(self, examples, *, sgd=None, losses=None, **config):
raise NotImplementedError
def add_label(self, label):
raise NotImplementedError

View File

@ -25,7 +25,7 @@ BACKWARD_EXTEND = False
default_model_config = """ default_model_config = """
[model] [model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.Tok2Vec.v2" @architectures = "spacy.Tok2Vec.v2"

View File

@ -20,7 +20,7 @@ BACKWARD_OVERWRITE = False
default_model_config = """ default_model_config = """
[model] [model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2" @architectures = "spacy.HashEmbedCNN.v2"

View File

@ -272,6 +272,24 @@ class SpanCategorizer(TrainablePipe):
scores = self.model.predict((docs, indices)) # type: ignore scores = self.model.predict((docs, indices)) # type: ignore
return indices, scores return indices, scores
def set_candidates(
self, docs: Iterable[Doc], *, candidates_key: str = "candidates"
) -> None:
"""Use the spancat suggester to add a list of span candidates to a list of docs.
This method is intended to be used for debugging purposes.
docs (Iterable[Doc]): The documents to modify.
candidates_key (str): Key of the Doc.spans dict to save the candidate spans under.
DOCS: https://spacy.io/api/spancategorizer#set_candidates
"""
suggester_output = self.suggester(docs, ops=self.model.ops)
for candidates, doc in zip(suggester_output, docs): # type: ignore
doc.spans[candidates_key] = []
for index in candidates.dataXd:
doc.spans[candidates_key].append(doc[index[0] : index[1]])
def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None: def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None:
"""Modify a batch of Doc objects, using pre-computed scores. """Modify a batch of Doc objects, using pre-computed scores.
@ -378,7 +396,7 @@ class SpanCategorizer(TrainablePipe):
# If the prediction is 0.9 and it's false, the gradient will be # If the prediction is 0.9 and it's false, the gradient will be
# 0.9 (0.9 - 0.0) # 0.9 (0.9 - 0.0)
d_scores = scores - target d_scores = scores - target
loss = float((d_scores ** 2).sum()) loss = float((d_scores**2).sum())
return loss, d_scores return loss, d_scores
def initialize( def initialize(

View File

@ -27,7 +27,7 @@ BACKWARD_OVERWRITE = False
default_model_config = """ default_model_config = """
[model] [model]
@architectures = "spacy.Tagger.v1" @architectures = "spacy.Tagger.v2"
[model.tok2vec] [model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2" @architectures = "spacy.HashEmbedCNN.v2"
@ -225,6 +225,7 @@ class Tagger(TrainablePipe):
DOCS: https://spacy.io/api/tagger#rehearse DOCS: https://spacy.io/api/tagger#rehearse
""" """
loss_func = SequenceCategoricalCrossentropy()
if losses is None: if losses is None:
losses = {} losses = {}
losses.setdefault(self.name, 0.0) losses.setdefault(self.name, 0.0)
@ -236,12 +237,12 @@ class Tagger(TrainablePipe):
# Handle cases where there are no tokens in any docs. # Handle cases where there are no tokens in any docs.
return losses return losses
set_dropout_rate(self.model, drop) set_dropout_rate(self.model, drop)
guesses, backprop = self.model.begin_update(docs) tag_scores, bp_tag_scores = self.model.begin_update(docs)
target = self._rehearsal_model(examples) tutor_tag_scores, _ = self._rehearsal_model.begin_update(docs)
gradient = guesses - target grads, loss = loss_func(tag_scores, tutor_tag_scores)
backprop(gradient) bp_tag_scores(grads)
self.finish_update(sgd) self.finish_update(sgd)
losses[self.name] += (gradient**2).sum() losses[self.name] += loss
return losses return losses
def get_loss(self, examples, scores): def get_loss(self, examples, scores):

View File

@ -283,12 +283,12 @@ class TextCategorizer(TrainablePipe):
return losses return losses
set_dropout_rate(self.model, drop) set_dropout_rate(self.model, drop)
scores, bp_scores = self.model.begin_update(docs) scores, bp_scores = self.model.begin_update(docs)
target = self._rehearsal_model(examples) target, _ = self._rehearsal_model.begin_update(docs)
gradient = scores - target gradient = scores - target
bp_scores(gradient) bp_scores(gradient)
if sgd is not None: if sgd is not None:
self.finish_update(sgd) self.finish_update(sgd)
losses[self.name] += (gradient ** 2).sum() losses[self.name] += (gradient**2).sum()
return losses return losses
def _examples_to_truth( def _examples_to_truth(
@ -320,9 +320,9 @@ class TextCategorizer(TrainablePipe):
self._validate_categories(examples) self._validate_categories(examples)
truths, not_missing = self._examples_to_truth(examples) truths, not_missing = self._examples_to_truth(examples)
not_missing = self.model.ops.asarray(not_missing) # type: ignore not_missing = self.model.ops.asarray(not_missing) # type: ignore
d_scores = (scores - truths) d_scores = scores - truths
d_scores *= not_missing d_scores *= not_missing
mean_square_error = (d_scores ** 2).mean() mean_square_error = (d_scores**2).mean()
return float(mean_square_error), d_scores return float(mean_square_error), d_scores
def add_label(self, label: str) -> int: def add_label(self, label: str) -> int:

View File

@ -118,6 +118,10 @@ class Tok2Vec(TrainablePipe):
DOCS: https://spacy.io/api/tok2vec#predict DOCS: https://spacy.io/api/tok2vec#predict
""" """
if not any(len(doc) for doc in docs):
# Handle cases where there are no tokens in any docs.
width = self.model.get_dim("nO")
return [self.model.ops.alloc((0, width)) for doc in docs]
tokvecs = self.model.predict(docs) tokvecs = self.model.predict(docs)
batch_id = Tok2VecListener.get_batch_id(docs) batch_id = Tok2VecListener.get_batch_id(docs)
for listener in self.listeners: for listener in self.listeners:

View File

@ -228,7 +228,7 @@ class Scorer:
if token.orth_.isspace(): if token.orth_.isspace():
continue continue
if align.x2y.lengths[token.i] == 1: if align.x2y.lengths[token.i] == 1:
gold_i = align.x2y[token.i].dataXd[0, 0] gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices: if gold_i not in missing_indices:
pred_tags.add((gold_i, getter(token, attr))) pred_tags.add((gold_i, getter(token, attr)))
tag_score.score_set(pred_tags, gold_tags) tag_score.score_set(pred_tags, gold_tags)
@ -287,7 +287,7 @@ class Scorer:
if token.orth_.isspace(): if token.orth_.isspace():
continue continue
if align.x2y.lengths[token.i] == 1: if align.x2y.lengths[token.i] == 1:
gold_i = align.x2y[token.i].dataXd[0, 0] gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices: if gold_i not in missing_indices:
value = getter(token, attr) value = getter(token, attr)
morph = gold_doc.vocab.strings[value] morph = gold_doc.vocab.strings[value]
@ -694,13 +694,13 @@ class Scorer:
if align.x2y.lengths[token.i] != 1: if align.x2y.lengths[token.i] != 1:
gold_i = None # type: ignore gold_i = None # type: ignore
else: else:
gold_i = align.x2y[token.i].dataXd[0, 0] gold_i = align.x2y[token.i][0]
if gold_i not in missing_indices: if gold_i not in missing_indices:
dep = getter(token, attr) dep = getter(token, attr)
head = head_getter(token, head_attr) head = head_getter(token, head_attr)
if dep not in ignore_labels and token.orth_.strip(): if dep not in ignore_labels and token.orth_.strip():
if align.x2y.lengths[head.i] == 1: if align.x2y.lengths[head.i] == 1:
gold_head = align.x2y[head.i].dataXd[0, 0] gold_head = align.x2y[head.i][0]
else: else:
gold_head = None gold_head = None
# None is indistinct, so we can't just add it to the set # None is indistinct, so we can't just add it to the set
@ -750,7 +750,7 @@ def get_ner_prf(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
for pred_ent in eg.x.ents: for pred_ent in eg.x.ents:
if pred_ent.label_ not in score_per_type: if pred_ent.label_ not in score_per_type:
score_per_type[pred_ent.label_] = PRFScore() score_per_type[pred_ent.label_] = PRFScore()
indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel() indices = align_x2y[pred_ent.start : pred_ent.end]
if len(indices): if len(indices):
g_span = eg.y[indices[0] : indices[-1] + 1] g_span = eg.y[indices[0] : indices[-1] + 1]
# Check we aren't missing annotation on this span. If so, # Check we aren't missing annotation on this span. If so,

View File

@ -1,4 +1,4 @@
from typing import Optional, Iterable, Iterator, Union, Any from typing import Optional, Iterable, Iterator, Union, Any, overload
from pathlib import Path from pathlib import Path
def get_string_id(key: Union[str, int]) -> int: ... def get_string_id(key: Union[str, int]) -> int: ...
@ -7,7 +7,10 @@ class StringStore:
def __init__( def __init__(
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ... self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
) -> None: ... ) -> None: ...
def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ... @overload
def __getitem__(self, string_or_id: Union[bytes, str]) -> int: ...
@overload
def __getitem__(self, string_or_id: int) -> str: ...
def as_int(self, key: Union[bytes, str, int]) -> int: ... def as_int(self, key: Union[bytes, str, int]) -> int: ...
def as_string(self, key: Union[bytes, str, int]) -> str: ... def as_string(self, key: Union[bytes, str, int]) -> str: ...
def add(self, string: str) -> int: ... def add(self, string: str) -> int: ...

View File

@ -99,6 +99,11 @@ def de_vocab():
return get_lang_class("de")().vocab return get_lang_class("de")().vocab
@pytest.fixture(scope="session")
def dsb_tokenizer():
return get_lang_class("dsb")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def el_tokenizer(): def el_tokenizer():
return get_lang_class("el")().tokenizer return get_lang_class("el")().tokenizer
@ -221,12 +226,30 @@ def ja_tokenizer():
return get_lang_class("ja")().tokenizer return get_lang_class("ja")().tokenizer
@pytest.fixture(scope="session")
def hsb_tokenizer():
return get_lang_class("hsb")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def ko_tokenizer(): def ko_tokenizer():
pytest.importorskip("natto") pytest.importorskip("natto")
return get_lang_class("ko")().tokenizer return get_lang_class("ko")().tokenizer
@pytest.fixture(scope="session")
def ko_tokenizer_tokenizer():
config = {
"nlp": {
"tokenizer": {
"@tokenizers": "spacy.Tokenizer.v1",
}
}
}
nlp = get_lang_class("ko").from_config(config)
return nlp.tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def lb_tokenizer(): def lb_tokenizer():
return get_lang_class("lb")().tokenizer return get_lang_class("lb")().tokenizer
@ -334,6 +357,11 @@ def sv_tokenizer():
return get_lang_class("sv")().tokenizer return get_lang_class("sv")().tokenizer
@pytest.fixture(scope="session")
def ta_tokenizer():
return get_lang_class("ta")().tokenizer
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def th_tokenizer(): def th_tokenizer():
pytest.importorskip("pythainlp") pytest.importorskip("pythainlp")

View File

@ -1,6 +1,7 @@
import weakref import weakref
import numpy import numpy
from numpy.testing import assert_array_equal
import pytest import pytest
from thinc.api import NumpyOps, get_current_ops from thinc.api import NumpyOps, get_current_ops
@ -634,6 +635,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans assert "group" in m_doc.spans
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]]) assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
# can exclude spans
m_doc = Doc.from_docs(en_docs, exclude=["spans"])
assert "group" not in m_doc.spans
# can exclude user_data
m_doc = Doc.from_docs(en_docs, exclude=["user_data"])
assert m_doc.user_data == {}
# can merge empty docs # can merge empty docs
doc = Doc.from_docs([en_tokenizer("")] * 10) doc = Doc.from_docs([en_tokenizer("")] * 10)
@ -647,6 +656,20 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
assert "group" in m_doc.spans assert "group" in m_doc.spans
assert len(m_doc.spans["group"]) == 0 assert len(m_doc.spans["group"]) == 0
# with tensor
ops = get_current_ops()
for doc in en_docs:
doc.tensor = ops.asarray([[len(t.text), 0.0] for t in doc])
m_doc = Doc.from_docs(en_docs)
assert_array_equal(
ops.to_numpy(m_doc.tensor),
ops.to_numpy(ops.xp.vstack([doc.tensor for doc in en_docs if len(doc)])),
)
# can exclude tensor
m_doc = Doc.from_docs(en_docs, exclude=["tensor"])
assert m_doc.tensor.shape == (0,)
def test_doc_api_from_docs_ents(en_tokenizer): def test_doc_api_from_docs_ents(en_tokenizer):
texts = ["Merging the docs is fun.", "They don't think alike."] texts = ["Merging the docs is fun.", "They don't think alike."]
@ -684,6 +707,7 @@ def test_has_annotation(en_vocab):
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE") attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
for attr in attrs: for attr in attrs:
assert not doc.has_annotation(attr) assert not doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
doc[0].tag_ = "A" doc[0].tag_ = "A"
doc[0].pos_ = "X" doc[0].pos_ = "X"
@ -709,6 +733,27 @@ def test_has_annotation(en_vocab):
assert doc.has_annotation(attr, require_complete=True) assert doc.has_annotation(attr, require_complete=True)
def test_has_annotation_sents(en_vocab):
doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
for attr in attrs:
assert not doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
# The first token (index 0) is always assumed to be a sentence start,
# and ignored by the check in doc.has_annotation
doc[1].is_sent_start = False
for attr in attrs:
assert doc.has_annotation(attr)
assert not doc.has_annotation(attr, require_complete=True)
doc[2].is_sent_start = False
for attr in attrs:
assert doc.has_annotation(attr)
assert doc.has_annotation(attr, require_complete=True)
def test_is_flags_deprecated(en_tokenizer): def test_is_flags_deprecated(en_tokenizer):
doc = en_tokenizer("test") doc = en_tokenizer("test")
with pytest.deprecated_call(): with pytest.deprecated_call():

View File

@ -655,3 +655,16 @@ def test_span_sents(doc, start, end, expected_sentences, expected_sentences_with
def test_span_sents_not_parsed(doc_not_parsed): def test_span_sents_not_parsed(doc_not_parsed):
with pytest.raises(ValueError): with pytest.raises(ValueError):
list(Span(doc_not_parsed, 0, 3).sents) list(Span(doc_not_parsed, 0, 3).sents)
def test_span_group_copy(doc):
doc.spans["test"] = [doc[0:1], doc[2:4]]
assert len(doc.spans["test"]) == 2
doc_copy = doc.copy()
# check that the spans were indeed copied
assert len(doc_copy.spans["test"]) == 2
# add a new span to the original doc
doc.spans["test"].append(doc[3:4])
assert len(doc.spans["test"]) == 3
# check that the copy spans were not modified and this is an isolated doc
assert len(doc_copy.spans["test"]) == 2

View File

@ -0,0 +1,242 @@
import pytest
from random import Random
from spacy.matcher import Matcher
from spacy.tokens import Span, SpanGroup
@pytest.fixture
def doc(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
return doc
@pytest.fixture
def other_doc(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
return doc
@pytest.fixture
def span_group(en_tokenizer):
doc = en_tokenizer("0 1 2 3 4 5 6")
matcher = Matcher(en_tokenizer.vocab, validate=True)
# fmt: off
matcher.add("4", [[{}, {}, {}, {}]])
matcher.add("2", [[{}, {}, ]])
matcher.add("1", [[{}, ]])
# fmt: on
matches = matcher(doc)
spans = []
for match in matches:
spans.append(
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
)
Random(42).shuffle(spans)
doc.spans["SPANS"] = SpanGroup(
doc, name="SPANS", attrs={"key": "value"}, spans=spans
)
def test_span_group_copy(doc):
span_group = doc.spans["SPANS"]
clone = span_group.copy()
assert clone != span_group
assert clone.name == span_group.name
assert clone.attrs == span_group.attrs
assert len(clone) == len(span_group)
assert list(span_group) == list(clone)
clone.name = "new_name"
clone.attrs["key"] = "new_value"
clone.append(Span(doc, 0, 6, "LABEL"))
assert clone.name != span_group.name
assert clone.attrs != span_group.attrs
assert span_group.attrs["key"] == "value"
assert list(span_group) != list(clone)
def test_span_group_set_item(doc, other_doc):
span_group = doc.spans["SPANS"]
index = 5
span = span_group[index]
span.label_ = "NEW LABEL"
span.kb_id = doc.vocab.strings["KB_ID"]
assert span_group[index].label != span.label
assert span_group[index].kb_id != span.kb_id
span_group[index] = span
assert span_group[index].start == span.start
assert span_group[index].end == span.end
assert span_group[index].label == span.label
assert span_group[index].kb_id == span.kb_id
assert span_group[index] == span
with pytest.raises(IndexError):
span_group[-100] = span
with pytest.raises(IndexError):
span_group[100] = span
span = Span(other_doc, 0, 2)
with pytest.raises(ValueError):
span_group[index] = span
def test_span_group_has_overlap(doc):
span_group = doc.spans["SPANS"]
assert span_group.has_overlap
def test_span_group_concat(doc, other_doc):
span_group_1 = doc.spans["SPANS"]
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_3 = span_group_1._concat(span_group_2)
assert span_group_3.name == span_group_1.name
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
span_list_expected = list(span_group_1) + list(span_group_2)
assert list(span_group_3) == list(span_list_expected)
# Inplace
span_list_expected = list(span_group_1) + list(span_group_2)
span_group_3 = span_group_1._concat(span_group_2, inplace=True)
assert span_group_3 == span_group_1
assert span_group_3.name == span_group_1.name
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_3) == list(span_list_expected)
span_group_2 = other_doc.spans["SPANS"]
with pytest.raises(ValueError):
span_group_1._concat(span_group_2)
def test_span_doc_delitem(doc):
span_group = doc.spans["SPANS"]
length = len(span_group)
index = 5
span = span_group[index]
next_span = span_group[index + 1]
del span_group[index]
assert len(span_group) == length - 1
assert span_group[index] != span
assert span_group[index] == next_span
with pytest.raises(IndexError):
del span_group[-100]
with pytest.raises(IndexError):
del span_group[100]
def test_span_group_add(doc):
span_group_1 = doc.spans["SPANS"]
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_3_expected = span_group_1._concat(span_group_2)
span_group_3 = span_group_1 + span_group_2
assert len(span_group_3) == len(span_group_3_expected)
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_3) == list(span_group_3_expected)
def test_span_group_iadd(doc):
span_group_1 = doc.spans["SPANS"].copy()
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_1_expected = span_group_1._concat(span_group_2)
span_group_1 += span_group_2
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_1) == list(span_group_1_expected)
span_group_1 = doc.spans["SPANS"].copy()
span_group_1 += spans
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {
"key": "value",
}
assert list(span_group_1) == list(span_group_1_expected)
def test_span_group_extend(doc):
span_group_1 = doc.spans["SPANS"].copy()
spans = [doc[0:5], doc[0:6]]
span_group_2 = SpanGroup(
doc,
name="MORE_SPANS",
attrs={"key": "new_value", "new_key": "new_value"},
spans=spans,
)
span_group_1_expected = span_group_1._concat(span_group_2)
span_group_1.extend(span_group_2)
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
assert list(span_group_1) == list(span_group_1_expected)
span_group_1 = doc.spans["SPANS"]
span_group_1.extend(spans)
assert len(span_group_1) == len(span_group_1_expected)
assert span_group_1.attrs == {"key": "value"}
assert list(span_group_1) == list(span_group_1_expected)
def test_span_group_dealloc(span_group):
with pytest.raises(AttributeError):
print(span_group.doc)

View File

@ -1,5 +1,5 @@
import pytest import pytest
from spacy.tokens import Doc from spacy.tokens import Doc, Span
@pytest.fixture() @pytest.fixture()
@ -60,3 +60,13 @@ def test_doc_to_json_underscore_error_serialize(doc):
Doc.set_extension("json_test4", method=lambda doc: doc.text) Doc.set_extension("json_test4", method=lambda doc: doc.text)
with pytest.raises(ValueError): with pytest.raises(ValueError):
doc.to_json(underscore=["json_test4"]) doc.to_json(underscore=["json_test4"])
def test_doc_to_json_span(doc):
"""Test that Doc.to_json() includes spans"""
doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
json_doc = doc.to_json()
assert "spans" in json_doc
assert len(json_doc["spans"]) == 1
assert len(json_doc["spans"]["test"]) == 2
assert json_doc["spans"]["test"][0]["start"] == 0

View File

View File

@ -0,0 +1,25 @@
import pytest
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10,000", True),
("10,00", True),
("jadno", True),
("dwanassćo", True),
("milion", True),
("sto", True),
("ceła", False),
("kopica", False),
("narěcow", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(dsb_tokenizer, text, match):
tokens = dsb_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -0,0 +1,29 @@
import pytest
DSB_BASIC_TOKENIZATION_TESTS = [
(
"Ale eksistěrujo mimo togo ceła kopica narěcow, ako na pśikład slěpjańska.",
[
"Ale",
"eksistěrujo",
"mimo",
"togo",
"ceła",
"kopica",
"narěcow",
",",
"ako",
"na",
"pśikład",
"slěpjańska",
".",
],
),
]
@pytest.mark.parametrize("text,expected_tokens", DSB_BASIC_TOKENIZATION_TESTS)
def test_dsb_tokenizer_basic(dsb_tokenizer, text, expected_tokens):
tokens = dsb_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -107,7 +107,17 @@ FI_NP_TEST_EXAMPLES = [
( (
"New York tunnetaan kaupunkina, joka ei koskaan nuku", "New York tunnetaan kaupunkina, joka ei koskaan nuku",
["PROPN", "PROPN", "VERB", "NOUN", "PUNCT", "PRON", "AUX", "ADV", "VERB"], ["PROPN", "PROPN", "VERB", "NOUN", "PUNCT", "PRON", "AUX", "ADV", "VERB"],
["obj", "flat:name", "ROOT", "obl", "punct", "nsubj", "aux", "advmod", "acl:relcl"], [
"obj",
"flat:name",
"ROOT",
"obl",
"punct",
"nsubj",
"aux",
"advmod",
"acl:relcl",
],
[2, -1, 0, -1, 4, 3, 2, 1, -5], [2, -1, 0, -1, 4, 3, 2, 1, -5],
["New York", "kaupunkina"], ["New York", "kaupunkina"],
), ),
@ -130,7 +140,12 @@ FI_NP_TEST_EXAMPLES = [
["NOUN", "VERB", "NOUN", "NOUN", "ADJ", "NOUN"], ["NOUN", "VERB", "NOUN", "NOUN", "ADJ", "NOUN"],
["nsubj", "ROOT", "obj", "obl", "amod", "obl"], ["nsubj", "ROOT", "obj", "obl", "amod", "obl"],
[1, 0, -1, -1, 1, -3], [1, 0, -1, -1, 1, -3],
["sairaanhoitopiirit", "leikkaustoimintaa", "alueellaan", "useammassa sairaalassa"], [
"sairaanhoitopiirit",
"leikkaustoimintaa",
"alueellaan",
"useammassa sairaalassa",
],
), ),
( (
"Lain mukaan varhaiskasvatus on suunnitelmallista toimintaa", "Lain mukaan varhaiskasvatus on suunnitelmallista toimintaa",

View File

View File

@ -0,0 +1,25 @@
import pytest
@pytest.mark.parametrize(
"text,match",
[
("10", True),
("1", True),
("10,000", True),
("10,00", True),
("jedne", True),
("dwanaće", True),
("milion", True),
("sto", True),
("załožene", False),
("wona", False),
("powšitkownej", False),
(",", False),
("1/2", True),
],
)
def test_lex_attrs_like_number(hsb_tokenizer, text, match):
tokens = hsb_tokenizer(text)
assert len(tokens) == 1
assert tokens[0].like_num == match

View File

@ -0,0 +1,32 @@
import pytest
HSB_BASIC_TOKENIZATION_TESTS = [
(
"Hornjoserbšćina wobsteji resp. wobsteješe z wjacorych dialektow, kotrež so zdźěla chětro wot so rozeznawachu.",
[
"Hornjoserbšćina",
"wobsteji",
"resp.",
"wobsteješe",
"z",
"wjacorych",
"dialektow",
",",
"kotrež",
"so",
"zdźěla",
"chětro",
"wot",
"so",
"rozeznawachu",
".",
],
),
]
@pytest.mark.parametrize("text,expected_tokens", HSB_BASIC_TOKENIZATION_TESTS)
def test_hsb_tokenizer_basic(hsb_tokenizer, text, expected_tokens):
tokens = hsb_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space]
assert expected_tokens == token_list

View File

@ -47,3 +47,29 @@ def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
def test_ko_empty_doc(ko_tokenizer): def test_ko_empty_doc(ko_tokenizer):
tokens = ko_tokenizer("") tokens = ko_tokenizer("")
assert len(tokens) == 0 assert len(tokens) == 0
@pytest.mark.issue(10535)
def test_ko_tokenizer_unknown_tag(ko_tokenizer):
tokens = ko_tokenizer("미닛 리피터")
assert tokens[1].pos_ == "X"
# fmt: off
SPACY_TOKENIZER_TESTS = [
("있다.", "있다 ."),
("''", "''"),
("부 (富) 는", "부 ( 富 ) 는"),
("부(富)는", "부 ( 富 ) 는"),
("1982~1983.", "1982 ~ 1983 ."),
("사과·배·복숭아·수박은 모두 과일이다.", "사과 · 배 · 복숭아 · 수박은 모두 과일이다 ."),
("그렇구나~", "그렇구나~"),
("『9시 반의 당구』,", "『 9시 반의 당구 』 ,"),
]
# fmt: on
@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)
def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):
tokens = [token.text for token in ko_tokenizer_tokenizer(text)]
assert tokens == expected_tokens.split()

View File

View File

@ -0,0 +1,25 @@
import pytest
from spacy.lang.ta import Tamil
# Wikipedia excerpt: https://en.wikipedia.org/wiki/Chennai (Tamil Language)
TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT = """சென்னை (Chennai) தமிழ்நாட்டின் தலைநகரமும், இந்தியாவின் நான்காவது பெரிய நகரமும் ஆகும். 1996 ஆம் ஆண்டுக்கு முன்னர் இந்நகரம், மதராசு பட்டினம், மெட்ராஸ் (Madras) மற்றும் சென்னப்பட்டினம் என்றும் அழைக்கப்பட்டு வந்தது. சென்னை, வங்காள விரிகுடாவின் கரையில் அமைந்த துறைமுக நகரங்களுள் ஒன்று. சுமார் 10 மில்லியன் (ஒரு கோடி) மக்கள் வாழும் இந்நகரம், உலகின் 35 பெரிய மாநகரங்களுள் ஒன்று. 17ஆம் நூற்றாண்டில் ஆங்கிலேயர் சென்னையில் கால் பதித்தது முதல், சென்னை நகரம் ஒரு முக்கிய நகரமாக வளர்ந்து வந்திருக்கிறது. சென்னை தென்னிந்தியாவின் வாசலாகக் கருதப்படுகிறது. சென்னை நகரில் உள்ள மெரினா கடற்கரை உலகின் நீளமான கடற்கரைகளுள் ஒன்று. சென்னை கோலிவுட் (Kollywood) என அறியப்படும் தமிழ்த் திரைப்படத் துறையின் தாயகம் ஆகும். பல விளையாட்டு அரங்கங்கள் உள்ள சென்னையில் பல விளையாட்டுப் போட்டிகளும் நடைபெறுகின்றன."""
@pytest.mark.parametrize(
"text, num_tokens",
[(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 23 + 90)], # Punctuation + rest
)
def test_long_text(ta_tokenizer, text, num_tokens):
tokens = ta_tokenizer(text)
assert len(tokens) == num_tokens
@pytest.mark.parametrize(
"text, num_sents", [(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 9)]
)
def test_ta_sentencizer(text, num_sents):
nlp = Tamil()
nlp.add_pipe("sentencizer")
doc = nlp(text)
assert len(list(doc.sents)) == num_sents

View File

@ -0,0 +1,188 @@
import pytest
from spacy.symbols import ORTH
from spacy.lang.ta import Tamil
TA_BASIC_TOKENIZATION_TESTS = [
(
"கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்",
["கிறிஸ்துமஸ்", "மற்றும்", "இனிய", "புத்தாண்டு", "வாழ்த்துக்கள்"],
),
(
"எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது",
["எனக்கு", "என்", "குழந்தைப்", "பருவம்", "நினைவிருக்கிறது"],
),
("உங்கள் பெயர் என்ன?", ["உங்கள்", "பெயர்", "என்ன", "?"]),
(
"ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்",
[
"ஏறத்தாழ",
"இலங்கைத்",
"தமிழரில்",
"மூன்றிலொரு",
"பங்கினர்",
"இலங்கையை",
"விட்டு",
"வெளியேறிப்",
"பிற",
"நாடுகளில்",
"வாழ்கின்றனர்",
],
),
(
"இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ் இலவசமாக வழங்கப்படவுள்ளது.",
[
"இந்த",
"ஃபோனுடன்",
"சுமார்",
"ரூ.2,990",
"மதிப்புள்ள",
"போட்",
"ராக்கர்ஸ்",
"நிறுவனத்தின்",
"ஸ்போர்ட்",
"புளூடூத்",
"ஹெட்போன்ஸ்",
"இலவசமாக",
"வழங்கப்படவுள்ளது",
".",
],
),
(
"மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்",
[
"மட்டக்களப்பில்",
"பல",
"இடங்களில்",
"வீட்டுத்",
"திட்டங்களுக்கு",
"இன்று",
"அடிக்கல்",
"நாட்டல்",
],
),
(
"ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும் விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது",
[
"",
"போன்க்கு",
"முகத்தை",
"வைத்து",
"அன்லாக்",
"செய்யும்",
"முறை",
"மற்றும்",
"விரலால்",
"தொட்டு",
"அன்லாக்",
"செய்யும்",
"முறையை",
"வாட்ஸ்",
"ஆப்",
"நிறுவனம்",
"இதற்கு",
"முன்",
"கண்டுபிடித்தது",
],
),
(
"இது ஒரு வாக்கியம்.",
[
"இது",
"ஒரு",
"வாக்கியம்",
".",
],
),
(
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
[
"தன்னாட்சி",
"கார்கள்",
"காப்பீட்டு",
"பொறுப்பை",
"உற்பத்தியாளரிடம்",
"மாற்றுகின்றன",
],
),
(
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
[
"நடைபாதை",
"விநியோக",
"ரோபோக்களை",
"தடை",
"செய்வதை",
"சான்",
"பிரான்சிஸ்கோ",
"கருதுகிறது",
],
),
(
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
[
"லண்டன்",
"ஐக்கிய",
"இராச்சியத்தில்",
"ஒரு",
"பெரிய",
"நகரம்",
".",
],
),
(
"என்ன வேலை செய்கிறீர்கள்?",
[
"என்ன",
"வேலை",
"செய்கிறீர்கள்",
"?",
],
),
(
"எந்த கல்லூரியில் படிக்கிறாய்?",
[
"எந்த",
"கல்லூரியில்",
"படிக்கிறாய்",
"?",
],
),
]
@pytest.mark.parametrize("text,expected_tokens", TA_BASIC_TOKENIZATION_TESTS)
def test_ta_tokenizer_basic(ta_tokenizer, text, expected_tokens):
tokens = ta_tokenizer(text)
token_list = [token.text for token in tokens]
assert expected_tokens == token_list
@pytest.mark.parametrize(
"text,expected_tokens",
[
(
"ஆப்பிள் நிறுவனம் யு.கே. தொடக்க நிறுவனத்தை ஒரு லட்சம் கோடிக்கு வாங்கப் பார்க்கிறது",
[
"ஆப்பிள்",
"நிறுவனம்",
"யு.கே.",
"தொடக்க",
"நிறுவனத்தை",
"ஒரு",
"லட்சம்",
"கோடிக்கு",
"வாங்கப்",
"பார்க்கிறது",
],
)
],
)
def test_ta_tokenizer_special_case(text, expected_tokens):
# Add a special rule to tokenize the initialism "யு.கே." (U.K., as
# in the country) as a single token.
nlp = Tamil()
nlp.tokenizer.add_special_case("யு.கே.", [{ORTH: "யு.கே."}])
tokens = nlp(text)
token_list = [token.text for token in tokens]
assert expected_tokens == token_list

View File

@ -41,7 +41,7 @@ def test_tr_lex_attrs_like_number_cardinal_ordinal(word):
assert like_num(word) assert like_num(word)
@pytest.mark.parametrize("word", ["beş", "yedi", "yedinci", "birinci"]) @pytest.mark.parametrize("word", ["beş", "yedi", "yedinci", "birinci", "milyonuncu"])
def test_tr_lex_attrs_capitals(word): def test_tr_lex_attrs_capitals(word):
assert like_num(word) assert like_num(word)
assert like_num(word.upper()) assert like_num(word.upper())

View File

@ -694,5 +694,4 @@ TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens): def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
tokens = tr_tokenizer(text) tokens = tr_tokenizer(text)
token_list = [token.text for token in tokens if not token.is_space] token_list = [token.text for token in tokens if not token.is_space]
print(token_list)
assert expected_tokens == token_list assert expected_tokens == token_list

View File

@ -12,6 +12,7 @@ def test_build_dependencies():
"flake8", "flake8",
"hypothesis", "hypothesis",
"pre-commit", "pre-commit",
"black",
"mypy", "mypy",
"types-dataclasses", "types-dataclasses",
"types-mock", "types-mock",

View File

@ -93,8 +93,8 @@ def test_parser_pseudoprojectivity(en_vocab):
assert nonproj.is_decorated("X") is False assert nonproj.is_decorated("X") is False
nonproj._lift(0, tree) nonproj._lift(0, tree)
assert tree == [2, 2, 2] assert tree == [2, 2, 2]
assert nonproj._get_smallest_nonproj_arc(nonproj_tree) == 7 assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree) == 7
assert nonproj._get_smallest_nonproj_arc(nonproj_tree2) == 10 assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree2) == 10
# fmt: off # fmt: off
proj_heads, deco_labels = nonproj.projectivize(nonproj_tree, labels) proj_heads, deco_labels = nonproj.projectivize(nonproj_tree, labels)
assert proj_heads == [1, 2, 2, 4, 5, 2, 7, 5, 2] assert proj_heads == [1, 2, 2, 4, 5, 2, 7, 5, 2]

View File

@ -0,0 +1,280 @@
import pickle
import pytest
from hypothesis import given
import hypothesis.strategies as st
from spacy import util
from spacy.lang.en import English
from spacy.language import Language
from spacy.pipeline._edit_tree_internals.edit_trees import EditTrees
from spacy.training import Example
from spacy.strings import StringStore
from spacy.util import make_tempdir
TRAIN_DATA = [
("She likes green eggs", {"lemmas": ["she", "like", "green", "egg"]}),
("Eat blue ham", {"lemmas": ["eat", "blue", "ham"]}),
]
PARTIAL_DATA = [
# partial annotation
("She likes green eggs", {"lemmas": ["", "like", "green", ""]}),
# misaligned partial annotation
(
"He hates green eggs",
{
"words": ["He", "hat", "es", "green", "eggs"],
"lemmas": ["", "hat", "e", "green", ""],
},
),
]
def test_initialize_examples():
nlp = Language()
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
# you shouldn't really call this more than once, but for testing it should be fine
nlp.initialize(get_examples=lambda: train_examples)
with pytest.raises(TypeError):
nlp.initialize(get_examples=lambda: None)
with pytest.raises(TypeError):
nlp.initialize(get_examples=lambda: train_examples[0])
with pytest.raises(TypeError):
nlp.initialize(get_examples=lambda: [])
with pytest.raises(TypeError):
nlp.initialize(get_examples=train_examples)
def test_initialize_from_labels():
nlp = Language()
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
lemmatizer.min_tree_freq = 1
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
nlp.initialize(get_examples=lambda: train_examples)
nlp2 = Language()
lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
lemmatizer2.initialize(
get_examples=lambda: train_examples,
labels=lemmatizer.label_data,
)
assert lemmatizer2.tree2label == {1: 0, 3: 1, 4: 2, 6: 3}
def test_no_data():
# Test that the lemmatizer provides a nice error when there's no tagging data / labels
TEXTCAT_DATA = [
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
]
nlp = English()
nlp.add_pipe("trainable_lemmatizer")
nlp.add_pipe("textcat")
train_examples = []
for t in TEXTCAT_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
with pytest.raises(ValueError):
nlp.initialize(get_examples=lambda: train_examples)
def test_incomplete_data():
# Test that the lemmatizer works with incomplete information
nlp = English()
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
lemmatizer.min_tree_freq = 1
train_examples = []
for t in PARTIAL_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["trainable_lemmatizer"] < 0.00001
# test the trained model
test_text = "She likes blue eggs"
doc = nlp(test_text)
assert doc[1].lemma_ == "like"
assert doc[2].lemma_ == "blue"
def test_overfitting_IO():
nlp = English()
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
lemmatizer.min_tree_freq = 1
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(50):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
assert losses["trainable_lemmatizer"] < 0.00001
test_text = "She likes blue eggs"
doc = nlp(test_text)
assert doc[0].lemma_ == "she"
assert doc[1].lemma_ == "like"
assert doc[2].lemma_ == "blue"
assert doc[3].lemma_ == "egg"
# Check model after a {to,from}_disk roundtrip
with util.make_tempdir() as tmp_dir:
nlp.to_disk(tmp_dir)
nlp2 = util.load_model_from_path(tmp_dir)
doc2 = nlp2(test_text)
assert doc2[0].lemma_ == "she"
assert doc2[1].lemma_ == "like"
assert doc2[2].lemma_ == "blue"
assert doc2[3].lemma_ == "egg"
# Check model after a {to,from}_bytes roundtrip
nlp_bytes = nlp.to_bytes()
nlp3 = English()
nlp3.add_pipe("trainable_lemmatizer")
nlp3.from_bytes(nlp_bytes)
doc3 = nlp3(test_text)
assert doc3[0].lemma_ == "she"
assert doc3[1].lemma_ == "like"
assert doc3[2].lemma_ == "blue"
assert doc3[3].lemma_ == "egg"
# Check model after a pickle roundtrip.
nlp_bytes = pickle.dumps(nlp)
nlp4 = pickle.loads(nlp_bytes)
doc4 = nlp4(test_text)
assert doc4[0].lemma_ == "she"
assert doc4[1].lemma_ == "like"
assert doc4[2].lemma_ == "blue"
assert doc4[3].lemma_ == "egg"
def test_lemmatizer_requires_labels():
nlp = English()
nlp.add_pipe("trainable_lemmatizer")
with pytest.raises(ValueError):
nlp.initialize()
def test_lemmatizer_label_data():
nlp = English()
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
lemmatizer.min_tree_freq = 1
train_examples = []
for t in TRAIN_DATA:
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
nlp.initialize(get_examples=lambda: train_examples)
nlp2 = English()
lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
lemmatizer2.initialize(
get_examples=lambda: train_examples, labels=lemmatizer.label_data
)
# Verify that the labels and trees are the same.
assert lemmatizer.labels == lemmatizer2.labels
assert lemmatizer.trees.to_bytes() == lemmatizer2.trees.to_bytes()
def test_dutch():
strings = StringStore()
trees = EditTrees(strings)
tree = trees.add("deelt", "delen")
assert trees.tree_to_str(tree) == "(m 0 3 () (m 0 2 (s '' 'l') (s 'lt' 'n')))"
tree = trees.add("gedeeld", "delen")
assert (
trees.tree_to_str(tree) == "(m 2 3 (s 'ge' '') (m 0 2 (s '' 'l') (s 'ld' 'n')))"
)
def test_from_to_bytes():
strings = StringStore()
trees = EditTrees(strings)
trees.add("deelt", "delen")
trees.add("gedeeld", "delen")
b = trees.to_bytes()
trees2 = EditTrees(strings)
trees2.from_bytes(b)
# Verify that the nodes did not change.
assert len(trees) == len(trees2)
for i in range(len(trees)):
assert trees.tree_to_str(i) == trees2.tree_to_str(i)
# Reinserting the same trees should not add new nodes.
trees2.add("deelt", "delen")
trees2.add("gedeeld", "delen")
assert len(trees) == len(trees2)
def test_from_to_disk():
strings = StringStore()
trees = EditTrees(strings)
trees.add("deelt", "delen")
trees.add("gedeeld", "delen")
trees2 = EditTrees(strings)
with make_tempdir() as temp_dir:
trees_file = temp_dir / "edit_trees.bin"
trees.to_disk(trees_file)
trees2 = trees2.from_disk(trees_file)
# Verify that the nodes did not change.
assert len(trees) == len(trees2)
for i in range(len(trees)):
assert trees.tree_to_str(i) == trees2.tree_to_str(i)
# Reinserting the same trees should not add new nodes.
trees2.add("deelt", "delen")
trees2.add("gedeeld", "delen")
assert len(trees) == len(trees2)
@given(st.text(), st.text())
def test_roundtrip(form, lemma):
strings = StringStore()
trees = EditTrees(strings)
tree = trees.add(form, lemma)
assert trees.apply(tree, form) == lemma
@given(st.text(alphabet="ab"), st.text(alphabet="ab"))
def test_roundtrip_small_alphabet(form, lemma):
# Test with small alphabets to have more overlap.
strings = StringStore()
trees = EditTrees(strings)
tree = trees.add(form, lemma)
assert trees.apply(tree, form) == lemma
def test_unapplicable_trees():
strings = StringStore()
trees = EditTrees(strings)
tree3 = trees.add("deelt", "delen")
# Replacement fails.
assert trees.apply(tree3, "deeld") == None
# Suffix + prefix are too large.
assert trees.apply(tree3, "de") == None
def test_empty_strings():
strings = StringStore()
trees = EditTrees(strings)
no_change = trees.add("xyz", "xyz")
empty = trees.add("", "")
assert no_change == empty

View File

@ -9,6 +9,9 @@ from spacy.compat import pickle
from spacy.kb import Candidate, KnowledgeBase, get_candidates from spacy.kb import Candidate, KnowledgeBase, get_candidates
from spacy.lang.en import English from spacy.lang.en import English
from spacy.ml import load_kb from spacy.ml import load_kb
from spacy.pipeline import EntityLinker
from spacy.pipeline.legacy import EntityLinker_v1
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
from spacy.scorer import Scorer from spacy.scorer import Scorer
from spacy.tests.util import make_tempdir from spacy.tests.util import make_tempdir
from spacy.tokens import Span from spacy.tokens import Span
@ -168,6 +171,45 @@ def test_issue7065_b():
assert doc assert doc
def test_no_entities():
# Test that having no entities doesn't crash the model
TRAIN_DATA = [
(
"The sky is blue.",
{
"sent_starts": [1, 0, 0, 0, 0],
},
)
]
nlp = English()
vector_length = 3
train_examples = []
for text, annotation in TRAIN_DATA:
doc = nlp(text)
train_examples.append(Example.from_dict(doc, annotation))
def create_kb(vocab):
# create artificial KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
return mykb
# Create and train the Entity Linker
entity_linker = nlp.add_pipe("entity_linker", last=True)
entity_linker.set_kb(create_kb)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# adding additional components that are required for the entity_linker
nlp.add_pipe("sentencizer", first=True)
# this will run the pipeline on the examples and shouldn't crash
results = nlp.evaluate(train_examples)
def test_partial_links(): def test_partial_links():
# Test that having some entities on the doc without gold links, doesn't crash # Test that having some entities on the doc without gold links, doesn't crash
TRAIN_DATA = [ TRAIN_DATA = [
@ -650,7 +692,7 @@ TRAIN_DATA = [
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}), "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}),
("Russ Cochran his reprints include EC Comics.", ("Russ Cochran his reprints include EC Comics.",
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}, {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
"entities": [(0, 12, "PERSON")], "entities": [(0, 12, "PERSON"), (34, 43, "ART")],
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}), "sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}),
("Russ Cochran has been publishing comic art.", ("Russ Cochran has been publishing comic art.",
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}}, {"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
@ -693,6 +735,7 @@ def test_overfitting_IO():
# Create the Entity Linker component and add it to the pipeline # Create the Entity Linker component and add it to the pipeline
entity_linker = nlp.add_pipe("entity_linker", last=True) entity_linker = nlp.add_pipe("entity_linker", last=True)
assert isinstance(entity_linker, EntityLinker)
entity_linker.set_kb(create_kb) entity_linker.set_kb(create_kb)
assert "Q2146908" in entity_linker.vocab.strings assert "Q2146908" in entity_linker.vocab.strings
assert "Q2146908" in entity_linker.kb.vocab.strings assert "Q2146908" in entity_linker.kb.vocab.strings
@ -922,3 +965,113 @@ def test_scorer_links():
assert scores["nel_micro_p"] == 2 / 3 assert scores["nel_micro_p"] == 2 / 3
assert scores["nel_micro_r"] == 2 / 4 assert scores["nel_micro_r"] == 2 / 4
# fmt: off
@pytest.mark.parametrize(
"name,config",
[
("entity_linker", {"@architectures": "spacy.EntityLinker.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
("entity_linker", {"@architectures": "spacy.EntityLinker.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
],
)
# fmt: on
def test_legacy_architectures(name, config):
# Ensure that the legacy architectures still work
vector_length = 3
nlp = English()
train_examples = []
for text, annotation in TRAIN_DATA:
doc = nlp.make_doc(text)
train_examples.append(Example.from_dict(doc, annotation))
def create_kb(vocab):
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7])
mykb.add_alias(
alias="Russ Cochran",
entities=["Q2146908", "Q7381115"],
probabilities=[0.5, 0.5],
)
return mykb
entity_linker = nlp.add_pipe(name, config={"model": config})
if config["@architectures"] == "spacy.EntityLinker.v1":
assert isinstance(entity_linker, EntityLinker_v1)
else:
assert isinstance(entity_linker, EntityLinker)
entity_linker.set_kb(create_kb)
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
@pytest.mark.parametrize(
"patterns",
[
# perfect case
[{"label": "CHARACTER", "pattern": "Kirby"}],
# typo for false negative
[{"label": "PERSON", "pattern": "Korby"}],
# random stuff for false positive
[{"label": "IS", "pattern": "is"}, {"label": "COLOR", "pattern": "pink"}],
],
)
def test_no_gold_ents(patterns):
# test that annotating components work
TRAIN_DATA = [
(
"Kirby is pink",
{
"links": {(0, 5): {"Q613241": 1.0}},
"entities": [(0, 5, "CHARACTER")],
"sent_starts": [1, 0, 0],
},
)
]
nlp = English()
vector_length = 3
train_examples = []
for text, annotation in TRAIN_DATA:
doc = nlp(text)
train_examples.append(Example.from_dict(doc, annotation))
# Create a ruler to mark entities
ruler = nlp.add_pipe("entity_ruler")
ruler.add_patterns(patterns)
# Apply ruler to examples. In a real pipeline this would be an annotating component.
for eg in train_examples:
eg.predicted = ruler(eg.predicted)
def create_kb(vocab):
# create artificial KB
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
mykb.add_entity(entity="Q613241", freq=12, entity_vector=[6, -4, 3])
mykb.add_alias("Kirby", ["Q613241"], [0.9])
# Placeholder
mykb.add_entity(entity="pink", freq=12, entity_vector=[7, 2, -5])
mykb.add_alias("pink", ["pink"], [0.9])
return mykb
# Create and train the Entity Linker
entity_linker = nlp.add_pipe(
"entity_linker", config={"use_gold_ents": False}, last=True
)
entity_linker.set_kb(create_kb)
assert entity_linker.use_gold_ents == False
optimizer = nlp.initialize(get_examples=lambda: train_examples)
for i in range(2):
losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses)
# adding additional components that are required for the entity_linker
nlp.add_pipe("sentencizer", first=True)
# this will run the pipeline on the examples and shouldn't crash
results = nlp.evaluate(train_examples)

View File

@ -184,7 +184,7 @@ def test_overfitting_IO():
token.pos_ = "" token.pos_ = ""
token.set_morph(None) token.set_morph(None)
optimizer = nlp.initialize(get_examples=lambda: train_examples) optimizer = nlp.initialize(get_examples=lambda: train_examples)
print(nlp.get_pipe("morphologizer").labels) assert nlp.get_pipe("morphologizer").labels is not None
for i in range(50): for i in range(50):
losses = {} losses = {}
nlp.update(train_examples, sgd=optimizer, losses=losses) nlp.update(train_examples, sgd=optimizer, losses=losses)

Some files were not shown because too many files have changed in this diff Show More