mirror of
https://github.com/explosion/spaCy.git
synced 2025-01-12 02:06:31 +03:00
Merge pull request #10777 from adrianeboyd/chore/update-develop-v3.4
Update develop for v3.4
This commit is contained in:
commit
6d17168c4d
2
.github/ISSUE_TEMPLATE/01_bugs.md
vendored
2
.github/ISSUE_TEMPLATE/01_bugs.md
vendored
|
@ -4,6 +4,8 @@ about: Use this template if you came across a bug or unexpected behaviour differ
|
|||
|
||||
---
|
||||
|
||||
<!-- NOTE: For questions or install related issues, please open a Discussion instead. -->
|
||||
|
||||
## How to reproduce the behaviour
|
||||
<!-- Include a code example or the steps that led to the problem. Please try to be as specific as possible. -->
|
||||
|
||||
|
|
3
.github/ISSUE_TEMPLATE/config.yml
vendored
3
.github/ISSUE_TEMPLATE/config.yml
vendored
|
@ -1,8 +1,5 @@
|
|||
blank_issues_enabled: false
|
||||
contact_links:
|
||||
- name: ⚠️ Python 3.10 Support
|
||||
url: https://github.com/explosion/spaCy/discussions/9418
|
||||
about: Python 3.10 wheels haven't been released yet, see the link for details.
|
||||
- name: 🗯 Discussions Forum
|
||||
url: https://github.com/explosion/spaCy/discussions
|
||||
about: Install issues, usage questions, general discussion and anything else that isn't a bug report.
|
||||
|
|
106
.github/contributors/fonfonx.md
vendored
Normal file
106
.github/contributors/fonfonx.md
vendored
Normal file
|
@ -0,0 +1,106 @@
|
|||
# spaCy contributor agreement
|
||||
|
||||
This spaCy Contributor Agreement (**"SCA"**) is based on the
|
||||
[Oracle Contributor Agreement](http://www.oracle.com/technetwork/oca-405177.pdf).
|
||||
The SCA applies to any contribution that you make to any product or project
|
||||
managed by us (the **"project"**), and sets out the intellectual property rights
|
||||
you grant to us in the contributed materials. The term **"us"** shall mean
|
||||
[ExplosionAI GmbH](https://explosion.ai/legal). The term
|
||||
**"you"** shall mean the person or entity identified below.
|
||||
|
||||
If you agree to be bound by these terms, fill in the information requested
|
||||
below and include the filled-in version with your first pull request, under the
|
||||
folder [`.github/contributors/`](/.github/contributors/). The name of the file
|
||||
should be your GitHub username, with the extension `.md`. For example, the user
|
||||
example_user would create the file `.github/contributors/example_user.md`.
|
||||
|
||||
Read this agreement carefully before signing. These terms and conditions
|
||||
constitute a binding legal agreement.
|
||||
|
||||
## Contributor Agreement
|
||||
|
||||
1. The term "contribution" or "contributed materials" means any source code,
|
||||
object code, patch, tool, sample, graphic, specification, manual,
|
||||
documentation, or any other material posted or submitted by you to the project.
|
||||
|
||||
2. With respect to any worldwide copyrights, or copyright applications and
|
||||
registrations, in your contribution:
|
||||
|
||||
* you hereby assign to us joint ownership, and to the extent that such
|
||||
assignment is or becomes invalid, ineffective or unenforceable, you hereby
|
||||
grant to us a perpetual, irrevocable, non-exclusive, worldwide, no-charge,
|
||||
royalty-free, unrestricted license to exercise all rights under those
|
||||
copyrights. This includes, at our option, the right to sublicense these same
|
||||
rights to third parties through multiple levels of sublicensees or other
|
||||
licensing arrangements;
|
||||
|
||||
* you agree that each of us can do all things in relation to your
|
||||
contribution as if each of us were the sole owners, and if one of us makes
|
||||
a derivative work of your contribution, the one who makes the derivative
|
||||
work (or has it made will be the sole owner of that derivative work;
|
||||
|
||||
* you agree that you will not assert any moral rights in your contribution
|
||||
against us, our licensees or transferees;
|
||||
|
||||
* you agree that we may register a copyright in your contribution and
|
||||
exercise all ownership rights associated with it; and
|
||||
|
||||
* you agree that neither of us has any duty to consult with, obtain the
|
||||
consent of, pay or render an accounting to the other for any use or
|
||||
distribution of your contribution.
|
||||
|
||||
3. With respect to any patents you own, or that you can license without payment
|
||||
to any third party, you hereby grant to us a perpetual, irrevocable,
|
||||
non-exclusive, worldwide, no-charge, royalty-free license to:
|
||||
|
||||
* make, have made, use, sell, offer to sell, import, and otherwise transfer
|
||||
your contribution in whole or in part, alone or in combination with or
|
||||
included in any product, work or materials arising out of the project to
|
||||
which your contribution was submitted, and
|
||||
|
||||
* at our option, to sublicense these same rights to third parties through
|
||||
multiple levels of sublicensees or other licensing arrangements.
|
||||
|
||||
4. Except as set out above, you keep all right, title, and interest in your
|
||||
contribution. The rights that you grant to us under these terms are effective
|
||||
on the date you first submitted a contribution to us, even if your submission
|
||||
took place before the date you sign these terms.
|
||||
|
||||
5. You covenant, represent, warrant and agree that:
|
||||
|
||||
* Each contribution that you submit is and shall be an original work of
|
||||
authorship and you can legally grant the rights set out in this SCA;
|
||||
|
||||
* to the best of your knowledge, each contribution will not violate any
|
||||
third party's copyrights, trademarks, patents, or other intellectual
|
||||
property rights; and
|
||||
|
||||
* each contribution shall be in compliance with U.S. export control laws and
|
||||
other applicable export and import laws. You agree to notify us if you
|
||||
become aware of any circumstance which would make any of the foregoing
|
||||
representations inaccurate in any respect. We may publicly disclose your
|
||||
participation in the project, including the fact that you have signed the SCA.
|
||||
|
||||
6. This SCA is governed by the laws of the State of California and applicable
|
||||
U.S. Federal law. Any choice of law rules will not apply.
|
||||
|
||||
7. Please place an “x” on one of the applicable statement below. Please do NOT
|
||||
mark both statements:
|
||||
|
||||
* [x] I am signing on behalf of myself as an individual and no other person
|
||||
or entity, including my employer, has or will have rights with respect to my
|
||||
contributions.
|
||||
|
||||
* [ ] I am signing on behalf of my employer or a legal entity and I have the
|
||||
actual authority to contractually bind that entity.
|
||||
|
||||
## Contributor Details
|
||||
|
||||
| Field | Entry |
|
||||
|------------------------------- | -------------------- |
|
||||
| Name | Xavier Fontaine |
|
||||
| Company name (if applicable) | |
|
||||
| Title or role (if applicable) | |
|
||||
| Date | 2022-04-13 |
|
||||
| GitHub username | fonfonx |
|
||||
| Website (optional) | |
|
21
.github/workflows/gputests.yml
vendored
Normal file
21
.github/workflows/gputests.yml
vendored
Normal file
|
@ -0,0 +1,21 @@
|
|||
name: Weekly GPU tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 1 * * MON'
|
||||
|
||||
jobs:
|
||||
weekly-gputests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Trigger buildkite build
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-gpu-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Weekly GPU + slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
37
.github/workflows/slowtests.yml
vendored
Normal file
37
.github/workflows/slowtests.yml
vendored
Normal file
|
@ -0,0 +1,37 @@
|
|||
name: Daily slow tests
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 0 * * *'
|
||||
|
||||
jobs:
|
||||
daily-slowtests:
|
||||
strategy:
|
||||
fail-fast: false
|
||||
matrix:
|
||||
branch: [master, v4]
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- name: Checkout
|
||||
uses: actions/checkout@v1
|
||||
with:
|
||||
ref: ${{ matrix.branch }}
|
||||
- name: Get commits from past 24 hours
|
||||
id: check_commits
|
||||
run: |
|
||||
today=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
yesterday=$(date -d "yesterday" '+%Y-%m-%d %H:%M:%S')
|
||||
if git log --after="$yesterday" --before="$today" | grep commit ; then
|
||||
echo "::set-output name=run_tests::true"
|
||||
else
|
||||
echo "::set-output name=run_tests::false"
|
||||
fi
|
||||
|
||||
- name: Trigger buildkite build
|
||||
if: steps.check_commits.outputs.run_tests == 'true'
|
||||
uses: buildkite/trigger-pipeline-action@v1.2.0
|
||||
env:
|
||||
PIPELINE: explosion-ai/spacy-slow-tests
|
||||
BRANCH: ${{ matrix.branch }}
|
||||
MESSAGE: ":github: Daily slow tests - triggered from a GitHub Action"
|
||||
BUILDKITE_API_ACCESS_TOKEN: ${{ secrets.BUILDKITE_SECRET }}
|
1
.gitignore
vendored
1
.gitignore
vendored
|
@ -9,7 +9,6 @@ keys/
|
|||
spacy/tests/package/setup.cfg
|
||||
spacy/tests/package/pyproject.toml
|
||||
spacy/tests/package/requirements.txt
|
||||
spacy/tests/universe/universe.json
|
||||
|
||||
# Website
|
||||
website/.cache/
|
||||
|
|
|
@ -1,9 +1,10 @@
|
|||
repos:
|
||||
- repo: https://github.com/ambv/black
|
||||
rev: 21.6b0
|
||||
rev: 22.3.0
|
||||
hooks:
|
||||
- id: black
|
||||
language_version: python3.7
|
||||
additional_dependencies: ['click==8.0.4']
|
||||
- repo: https://gitlab.com/pycqa/flake8
|
||||
rev: 3.9.2
|
||||
hooks:
|
||||
|
|
|
@ -233,7 +233,7 @@ also want to keep an eye on unused declared variables or repeated
|
|||
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
|
||||
(see above), you shouldn't see any formatting-related warnings.
|
||||
|
||||
The [`.flake8`](.flake8) config defines the configuration we use for this
|
||||
The `flake8` section in [`setup.cfg`](setup.cfg) defines the configuration we use for this
|
||||
codebase. For example, we're not super strict about the line length, and we're
|
||||
excluding very large files like lemmatization and tokenizer exception tables.
|
||||
|
||||
|
|
|
@ -33,7 +33,7 @@ open-source software, released under the MIT license.
|
|||
## 📖 Documentation
|
||||
|
||||
| Documentation | |
|
||||
| -------------------------- | -------------------------------------------------------------- |
|
||||
| ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| ⭐️ **[spaCy 101]** | New to spaCy? Here's everything you need to know! |
|
||||
| 📚 **[Usage Guides]** | How to use spaCy and its features. |
|
||||
| 🚀 **[New in v3.0]** | New features, backwards incompatibilities and migration guide. |
|
||||
|
@ -45,6 +45,7 @@ open-source software, released under the MIT license.
|
|||
| 📺 **[Videos]** | Our YouTube channel with video tutorials, talks and more. |
|
||||
| 🛠 **[Changelog]** | Changes and version history. |
|
||||
| 💝 **[Contribute]** | How to contribute to the spaCy project and code base. |
|
||||
| <a href="https://explosion.ai/spacy-tailored-pipelines"><img src="https://user-images.githubusercontent.com/13643239/152853098-1c761611-ccb0-4ec6-9066-b234552831fe.png" width="125" alt="spaCy Tailored Pipelines"/></a> | Get a custom spaCy pipeline, tailor-made for your NLP problem by spaCy's core developers. Streamlined, production-ready, predictable and maintainable. Start by completing our 5-minute questionnaire to tell us what you need and we'll be in touch! **[Learn more →](https://explosion.ai/spacy-tailored-pipelines)** |
|
||||
|
||||
[spacy 101]: https://spacy.io/usage/spacy-101
|
||||
[new in v3.0]: https://spacy.io/usage/v3
|
||||
|
@ -60,9 +61,7 @@ open-source software, released under the MIT license.
|
|||
|
||||
## 💬 Where to ask questions
|
||||
|
||||
The spaCy project is maintained by **[@honnibal](https://github.com/honnibal)**,
|
||||
**[@ines](https://github.com/ines)**, **[@svlandeg](https://github.com/svlandeg)**,
|
||||
**[@adrianeboyd](https://github.com/adrianeboyd)** and **[@polm](https://github.com/polm)**.
|
||||
The spaCy project is maintained by the [spaCy team](https://explosion.ai/about).
|
||||
Please understand that we won't be able to provide individual support via email.
|
||||
We also believe that help is much more valuable if it's shared publicly, so that
|
||||
more people can benefit from it.
|
||||
|
|
|
@ -11,12 +11,14 @@ trigger:
|
|||
exclude:
|
||||
- "website/*"
|
||||
- "*.md"
|
||||
- ".github/workflows/*"
|
||||
pr:
|
||||
paths:
|
||||
exclude:
|
||||
- "*.md"
|
||||
- "website/docs/*"
|
||||
- "website/src/*"
|
||||
- ".github/workflows/*"
|
||||
|
||||
jobs:
|
||||
# Perform basic checks for most important errors (syntax etc.) Uses the config
|
||||
|
|
|
@ -137,7 +137,7 @@ If any of the TODOs you've added are important and should be fixed soon, you sho
|
|||
|
||||
## Type hints
|
||||
|
||||
We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation.
|
||||
We use Python type hints across the `.py` files wherever possible. This makes it easy to understand what a function expects and returns, and modern editors will be able to show this information to you when you call an annotated function. Type hints are not currently used in the `.pyx` (Cython) code, except for definitions of registered functions and component factories, where they're used for config validation. Ideally when developing, run `mypy spacy` on the code base to inspect any issues.
|
||||
|
||||
If possible, you should always use the more descriptive type hints like `List[str]` or even `List[Any]` instead of only `list`. We also annotate arguments and return types of `Callable` – although, you can simplify this if the type otherwise gets too verbose (e.g. functions that return factories to create callbacks). Remember that `Callable` takes two values: a **list** of the argument type(s) in order, and the return values.
|
||||
|
||||
|
@ -155,6 +155,13 @@ def create_callback(some_arg: bool) -> Callable[[str, int], List[str]]:
|
|||
return callback
|
||||
```
|
||||
|
||||
For typing variables, we prefer the explicit format.
|
||||
|
||||
```diff
|
||||
- var = value # type: Type
|
||||
+ var: Type = value
|
||||
```
|
||||
|
||||
For model architectures, Thinc also provides a collection of [custom types](https://thinc.ai/docs/api-types), including more specific types for arrays and model inputs/outputs. Even outside of static type checking, using these types will make the code a lot easier to read and follow, since it's always clear what array types are expected (and what might go wrong if the output is different from the expected type).
|
||||
|
||||
```python
|
||||
|
|
36
extra/DEVELOPER_DOCS/ExplosionBot.md
Normal file
36
extra/DEVELOPER_DOCS/ExplosionBot.md
Normal file
|
@ -0,0 +1,36 @@
|
|||
# Explosion-bot
|
||||
|
||||
Explosion-bot is a robot that can be invoked to help with running particular test commands.
|
||||
|
||||
## Permissions
|
||||
|
||||
Only maintainers have permissions to summon explosion-bot. Each of the open source repos that use explosion-bot has its own team(s) of maintainers, and only github users who are members of those teams can successfully run bot commands.
|
||||
|
||||
## Running robot commands
|
||||
|
||||
To summon the robot, write a github comment on the issue/PR you wish to test. The comment must be in the following format:
|
||||
|
||||
```
|
||||
@explosion-bot please test_gpu
|
||||
```
|
||||
|
||||
Some things to note:
|
||||
|
||||
* The `@explosion-bot please` must be the beginning of the command - you cannot add anything in front of this or else the robot won't know how to parse it. Adding anything at the end aside from the test name will also confuse the robot, so keep it simple!
|
||||
* The command name (such as `test_gpu`) must be one of the tests that the bot knows how to run. The available commands are documented in the bot's [workflow config](https://github.com/explosion/spaCy/blob/master/.github/workflows/explosionbot.yml#L26) and must match exactly one of the commands listed there.
|
||||
* The robot can't do multiple things at once, so if you want it to run multiple tests, you'll have to summon it with one comment per test.
|
||||
* For the `test_gpu` command, you can specify an optional thinc branch (from the spaCy repo) or a spaCy branch (from the thinc repo) with either the `--thinc-branch` or `--spacy-branch` flags. By default, the bot will pull in the PR branch from the repo where the command was issued, and the main branch of the other repository. However, if you need to run against another branch, you can say (for example):
|
||||
|
||||
```
|
||||
@explosion-bot please test_gpu --thinc-branch develop
|
||||
```
|
||||
You can also specify a branch from an unmerged PR:
|
||||
```
|
||||
@explosion-bot please test_gpu --thinc-branch refs/pull/633/head
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
If the robot isn't responding to commands as expected, you can check its logs in the [Github Action](https://github.com/explosion/spaCy/actions/workflows/explosionbot.yml).
|
||||
|
||||
For each command sent to the bot, there should be a run of the `explosion-bot` workflow. In the `Install and run explosion-bot` step, towards the ends of the logs you should see info about the configuration that the bot was run with, as well as any errors that the bot encountered.
|
|
@ -5,7 +5,7 @@ requires = [
|
|||
"cymem>=2.0.2,<2.1.0",
|
||||
"preshed>=3.0.2,<3.1.0",
|
||||
"murmurhash>=0.28.0,<1.1.0",
|
||||
"thinc>=8.0.12,<8.1.0",
|
||||
"thinc>=8.0.14,<8.1.0",
|
||||
"blis>=0.4.0,<0.8.0",
|
||||
"pathy",
|
||||
"numpy>=1.15.0",
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
# Our libraries
|
||||
spacy-legacy>=3.0.8,<3.1.0
|
||||
spacy-legacy>=3.0.9,<3.1.0
|
||||
spacy-loggers>=1.0.0,<2.0.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.0.12,<8.1.0
|
||||
thinc>=8.0.14,<8.1.0
|
||||
blis>=0.4.0,<0.8.0
|
||||
ml_datasets>=0.2.0,<0.3.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
wasabi>=0.8.1,<1.1.0
|
||||
srsly>=2.4.1,<3.0.0
|
||||
wasabi>=0.9.1,<1.1.0
|
||||
srsly>=2.4.3,<3.0.0
|
||||
catalogue>=2.0.6,<2.1.0
|
||||
typer>=0.3.0,<0.5.0
|
||||
pathy>=0.3.5
|
||||
|
@ -26,7 +26,7 @@ typing_extensions>=3.7.4.1,<4.0.0.0; python_version < "3.8"
|
|||
# Development dependencies
|
||||
pre-commit>=2.13.0
|
||||
cython>=0.25,<3.0
|
||||
pytest>=5.2.0
|
||||
pytest>=5.2.0,!=7.1.0
|
||||
pytest-timeout>=1.3.0,<2.0.0
|
||||
mock>=2.0.0,<3.0.0
|
||||
flake8>=3.8.0,<3.10.0
|
||||
|
@ -35,3 +35,4 @@ mypy==0.910
|
|||
types-dataclasses>=0.1.3; python_version < "3.7"
|
||||
types-mock>=0.1.1
|
||||
types-requests
|
||||
black>=22.0,<23.0
|
||||
|
|
10
setup.cfg
10
setup.cfg
|
@ -38,18 +38,18 @@ setup_requires =
|
|||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
thinc>=8.0.12,<8.1.0
|
||||
thinc>=8.0.14,<8.1.0
|
||||
install_requires =
|
||||
# Our libraries
|
||||
spacy-legacy>=3.0.8,<3.1.0
|
||||
spacy-legacy>=3.0.9,<3.1.0
|
||||
spacy-loggers>=1.0.0,<2.0.0
|
||||
murmurhash>=0.28.0,<1.1.0
|
||||
cymem>=2.0.2,<2.1.0
|
||||
preshed>=3.0.2,<3.1.0
|
||||
thinc>=8.0.12,<8.1.0
|
||||
thinc>=8.0.14,<8.1.0
|
||||
blis>=0.4.0,<0.8.0
|
||||
wasabi>=0.8.1,<1.1.0
|
||||
srsly>=2.4.1,<3.0.0
|
||||
wasabi>=0.9.1,<1.1.0
|
||||
srsly>=2.4.3,<3.0.0
|
||||
catalogue>=2.0.6,<2.1.0
|
||||
typer>=0.3.0,<0.5.0
|
||||
pathy>=0.3.5
|
||||
|
|
3
setup.py
3
setup.py
|
@ -23,6 +23,7 @@ Options.docstrings = True
|
|||
|
||||
PACKAGES = find_packages()
|
||||
MOD_NAMES = [
|
||||
"spacy.training.alignment_array",
|
||||
"spacy.training.example",
|
||||
"spacy.parts_of_speech",
|
||||
"spacy.strings",
|
||||
|
@ -33,6 +34,7 @@ MOD_NAMES = [
|
|||
"spacy.ml.parser_model",
|
||||
"spacy.morphology",
|
||||
"spacy.pipeline.dep_parser",
|
||||
"spacy.pipeline._edit_tree_internals.edit_trees",
|
||||
"spacy.pipeline.morphologizer",
|
||||
"spacy.pipeline.multitask",
|
||||
"spacy.pipeline.ner",
|
||||
|
@ -81,7 +83,6 @@ COPY_FILES = {
|
|||
ROOT / "setup.cfg": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "pyproject.toml": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "requirements.txt": PACKAGE_ROOT / "tests" / "package",
|
||||
ROOT / "website" / "meta" / "universe.json": PACKAGE_ROOT / "tests" / "universe",
|
||||
}
|
||||
|
||||
|
||||
|
|
|
@ -1,6 +1,6 @@
|
|||
# fmt: off
|
||||
__title__ = "spacy"
|
||||
__version__ = "3.2.1"
|
||||
__version__ = "3.3.0"
|
||||
__download_url__ = "https://github.com/explosion/spacy-models/releases/download"
|
||||
__compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json"
|
||||
__projects__ = "https://github.com/explosion/projects"
|
||||
|
|
|
@ -14,6 +14,7 @@ from .pretrain import pretrain # noqa: F401
|
|||
from .debug_data import debug_data # noqa: F401
|
||||
from .debug_config import debug_config # noqa: F401
|
||||
from .debug_model import debug_model # noqa: F401
|
||||
from .debug_diff import debug_diff # noqa: F401
|
||||
from .evaluate import evaluate # noqa: F401
|
||||
from .convert import convert # noqa: F401
|
||||
from .init_pipeline import init_pipeline_cli # noqa: F401
|
||||
|
|
|
@ -360,7 +360,7 @@ def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False)
|
|||
src = str(src)
|
||||
with smart_open.open(src, mode="rb", ignore_ext=True) as input_file:
|
||||
with dest.open(mode="wb") as output_file:
|
||||
output_file.write(input_file.read())
|
||||
shutil.copyfileobj(input_file, output_file)
|
||||
|
||||
|
||||
def ensure_pathy(path):
|
||||
|
|
|
@ -19,6 +19,7 @@ from ..morphology import Morphology
|
|||
from ..language import Language
|
||||
from ..util import registry, resolve_dot_names
|
||||
from ..compat import Literal
|
||||
from ..vectors import Mode as VectorsMode
|
||||
from .. import util
|
||||
|
||||
|
||||
|
@ -170,6 +171,14 @@ def debug_data(
|
|||
show=verbose,
|
||||
)
|
||||
if len(nlp.vocab.vectors):
|
||||
if nlp.vocab.vectors.mode == VectorsMode.floret:
|
||||
msg.info(
|
||||
f"floret vectors with {len(nlp.vocab.vectors)} vectors, "
|
||||
f"{nlp.vocab.vectors_length} dimensions, "
|
||||
f"{nlp.vocab.vectors.minn}-{nlp.vocab.vectors.maxn} char "
|
||||
f"n-gram subwords"
|
||||
)
|
||||
else:
|
||||
msg.info(
|
||||
f"{len(nlp.vocab.vectors)} vectors ({nlp.vocab.vectors.n_keys} "
|
||||
f"unique keys, {nlp.vocab.vectors_length} dimensions)"
|
||||
|
@ -193,6 +202,70 @@ def debug_data(
|
|||
else:
|
||||
msg.info("No word vectors present in the package")
|
||||
|
||||
if "spancat" in factory_names:
|
||||
model_labels_spancat = _get_labels_from_spancat(nlp)
|
||||
has_low_data_warning = False
|
||||
has_no_neg_warning = False
|
||||
|
||||
msg.divider("Span Categorization")
|
||||
msg.table(model_labels_spancat, header=["Spans Key", "Labels"], divider=True)
|
||||
|
||||
msg.text("Label counts in train data: ", show=verbose)
|
||||
for spans_key, data_labels in gold_train_data["spancat"].items():
|
||||
msg.text(
|
||||
f"Key: {spans_key}, {_format_labels(data_labels.items(), counts=True)}",
|
||||
show=verbose,
|
||||
)
|
||||
# Data checks: only take the spans keys in the actual spancat components
|
||||
data_labels_in_component = {
|
||||
spans_key: gold_train_data["spancat"][spans_key]
|
||||
for spans_key in model_labels_spancat.keys()
|
||||
}
|
||||
for spans_key, data_labels in data_labels_in_component.items():
|
||||
for label, count in data_labels.items():
|
||||
# Check for missing labels
|
||||
spans_key_in_model = spans_key in model_labels_spancat.keys()
|
||||
if (spans_key_in_model) and (
|
||||
label not in model_labels_spancat[spans_key]
|
||||
):
|
||||
msg.warn(
|
||||
f"Label '{label}' is not present in the model labels of key '{spans_key}'. "
|
||||
"Performance may degrade after training."
|
||||
)
|
||||
# Check for low number of examples per label
|
||||
if count <= NEW_LABEL_THRESHOLD:
|
||||
msg.warn(
|
||||
f"Low number of examples for label '{label}' in key '{spans_key}' ({count})"
|
||||
)
|
||||
has_low_data_warning = True
|
||||
# Check for negative examples
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(
|
||||
train_dataset, label, "spancat", spans_key
|
||||
)
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
||||
if has_low_data_warning:
|
||||
msg.text(
|
||||
f"To train a new span type, your data should include at "
|
||||
f"least {NEW_LABEL_THRESHOLD} instances of the new label",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Good amount of examples for all labels")
|
||||
|
||||
if has_no_neg_warning:
|
||||
msg.text(
|
||||
"Training data should always include examples of spans "
|
||||
"in context, as well as examples without a given span "
|
||||
"type.",
|
||||
show=verbose,
|
||||
)
|
||||
else:
|
||||
msg.good("Examples without ocurrences available for all labels")
|
||||
|
||||
if "ner" in factory_names:
|
||||
# Get all unique NER labels present in the data
|
||||
labels = set(
|
||||
|
@ -238,7 +311,7 @@ def debug_data(
|
|||
has_low_data_warning = True
|
||||
|
||||
with msg.loading("Analyzing label distribution..."):
|
||||
neg_docs = _get_examples_without_label(train_dataset, label)
|
||||
neg_docs = _get_examples_without_label(train_dataset, label, "ner")
|
||||
if neg_docs == 0:
|
||||
msg.warn(f"No examples for texts WITHOUT new label '{label}'")
|
||||
has_no_neg_warning = True
|
||||
|
@ -573,6 +646,7 @@ def _compile_gold(
|
|||
"deps": Counter(),
|
||||
"words": Counter(),
|
||||
"roots": Counter(),
|
||||
"spancat": dict(),
|
||||
"ws_ents": 0,
|
||||
"boundary_cross_ents": 0,
|
||||
"n_words": 0,
|
||||
|
@ -603,6 +677,7 @@ def _compile_gold(
|
|||
if nlp.vocab.strings[word] not in nlp.vocab.vectors:
|
||||
data["words_missing_vectors"].update([word])
|
||||
if "ner" in factory_names:
|
||||
sent_starts = eg.get_aligned_sent_starts()
|
||||
for i, label in enumerate(eg.get_aligned_ner()):
|
||||
if label is None:
|
||||
continue
|
||||
|
@ -612,10 +687,19 @@ def _compile_gold(
|
|||
if label.startswith(("B-", "U-")):
|
||||
combined_label = label.split("-")[1]
|
||||
data["ner"][combined_label] += 1
|
||||
if gold[i].is_sent_start and label.startswith(("I-", "L-")):
|
||||
if sent_starts[i] == True and label.startswith(("I-", "L-")):
|
||||
data["boundary_cross_ents"] += 1
|
||||
elif label == "-":
|
||||
data["ner"]["-"] += 1
|
||||
if "spancat" in factory_names:
|
||||
for span_key in list(eg.reference.spans.keys()):
|
||||
if span_key not in data["spancat"]:
|
||||
data["spancat"][span_key] = Counter()
|
||||
for i, span in enumerate(eg.reference.spans[span_key]):
|
||||
if span.label_ is None:
|
||||
continue
|
||||
else:
|
||||
data["spancat"][span_key][span.label_] += 1
|
||||
if "textcat" in factory_names or "textcat_multilabel" in factory_names:
|
||||
data["cats"].update(gold.cats)
|
||||
if any(val not in (0, 1) for val in gold.cats.values()):
|
||||
|
@ -686,14 +770,28 @@ def _format_labels(
|
|||
return ", ".join([f"'{l}'" for l in cast(Iterable[str], labels)])
|
||||
|
||||
|
||||
def _get_examples_without_label(data: Sequence[Example], label: str) -> int:
|
||||
def _get_examples_without_label(
|
||||
data: Sequence[Example],
|
||||
label: str,
|
||||
component: Literal["ner", "spancat"] = "ner",
|
||||
spans_key: Optional[str] = "sc",
|
||||
) -> int:
|
||||
count = 0
|
||||
for eg in data:
|
||||
if component == "ner":
|
||||
labels = [
|
||||
label.split("-")[1]
|
||||
for label in eg.get_aligned_ner()
|
||||
if label not in ("O", "-", None)
|
||||
]
|
||||
|
||||
if component == "spancat":
|
||||
labels = (
|
||||
[span.label_ for span in eg.reference.spans[spans_key]]
|
||||
if spans_key in eg.reference.spans
|
||||
else []
|
||||
)
|
||||
|
||||
if label not in labels:
|
||||
count += 1
|
||||
return count
|
||||
|
|
89
spacy/cli/debug_diff.py
Normal file
89
spacy/cli/debug_diff.py
Normal file
|
@ -0,0 +1,89 @@
|
|||
from typing import Optional
|
||||
|
||||
import typer
|
||||
from wasabi import Printer, diff_strings, MarkdownRenderer
|
||||
from pathlib import Path
|
||||
from thinc.api import Config
|
||||
|
||||
from ._util import debug_cli, Arg, Opt, show_validation_error, parse_config_overrides
|
||||
from ..util import load_config
|
||||
from .init_config import init_config, Optimizations
|
||||
|
||||
|
||||
@debug_cli.command(
|
||||
"diff-config",
|
||||
context_settings={"allow_extra_args": True, "ignore_unknown_options": True},
|
||||
)
|
||||
def debug_diff_cli(
|
||||
# fmt: off
|
||||
ctx: typer.Context,
|
||||
config_path: Path = Arg(..., help="Path to config file", exists=True, allow_dash=True),
|
||||
compare_to: Optional[Path] = Opt(None, help="Path to a config file to diff against, or `None` to compare against default settings", exists=True, allow_dash=True),
|
||||
optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether the user config was optimized for efficiency or accuracy. Only relevant when comparing against the default config."),
|
||||
gpu: bool = Opt(False, "--gpu", "-G", help="Whether the original config can run on a GPU. Only relevant when comparing against the default config."),
|
||||
pretraining: bool = Opt(False, "--pretraining", "--pt", help="Whether to compare on a config with pretraining involved. Only relevant when comparing against the default config."),
|
||||
markdown: bool = Opt(False, "--markdown", "-md", help="Generate Markdown for GitHub issues")
|
||||
# fmt: on
|
||||
):
|
||||
"""Show a diff of a config file with respect to spaCy's defaults or another config file. If
|
||||
additional settings were used in the creation of the config file, then you
|
||||
must supply these as extra parameters to the command when comparing to the default settings. The generated diff
|
||||
can also be used when posting to the discussion forum to provide more
|
||||
information for the maintainers.
|
||||
|
||||
The `optimize`, `gpu`, and `pretraining` options are only relevant when
|
||||
comparing against the default configuration (or specifically when `compare_to` is None).
|
||||
|
||||
DOCS: https://spacy.io/api/cli#debug-diff
|
||||
"""
|
||||
debug_diff(
|
||||
config_path=config_path,
|
||||
compare_to=compare_to,
|
||||
gpu=gpu,
|
||||
optimize=optimize,
|
||||
pretraining=pretraining,
|
||||
markdown=markdown,
|
||||
)
|
||||
|
||||
|
||||
def debug_diff(
|
||||
config_path: Path,
|
||||
compare_to: Optional[Path],
|
||||
gpu: bool,
|
||||
optimize: Optimizations,
|
||||
pretraining: bool,
|
||||
markdown: bool,
|
||||
):
|
||||
msg = Printer()
|
||||
with show_validation_error(hint_fill=False):
|
||||
user_config = load_config(config_path)
|
||||
if compare_to:
|
||||
other_config = load_config(compare_to)
|
||||
else:
|
||||
# Recreate a default config based from user's config
|
||||
lang = user_config["nlp"]["lang"]
|
||||
pipeline = list(user_config["nlp"]["pipeline"])
|
||||
msg.info(f"Found user-defined language: '{lang}'")
|
||||
msg.info(f"Found user-defined pipelines: {pipeline}")
|
||||
other_config = init_config(
|
||||
lang=lang,
|
||||
pipeline=pipeline,
|
||||
optimize=optimize.value,
|
||||
gpu=gpu,
|
||||
pretraining=pretraining,
|
||||
silent=True,
|
||||
)
|
||||
|
||||
user = user_config.to_str()
|
||||
other = other_config.to_str()
|
||||
|
||||
if user == other:
|
||||
msg.warn("No diff to show: configs are identical")
|
||||
else:
|
||||
diff_text = diff_strings(other, user, add_symbols=markdown)
|
||||
if markdown:
|
||||
md = MarkdownRenderer()
|
||||
md.add(md.code_block(diff_text, "diff"))
|
||||
print(md.text)
|
||||
else:
|
||||
print(diff_text)
|
|
@ -7,6 +7,7 @@ from collections import defaultdict
|
|||
from catalogue import RegistryError
|
||||
import srsly
|
||||
import sys
|
||||
import re
|
||||
|
||||
from ._util import app, Arg, Opt, string_to_list, WHEEL_SUFFIX, SDIST_SUFFIX
|
||||
from ..schemas import validate, ModelMetaSchema
|
||||
|
@ -109,6 +110,24 @@ def package(
|
|||
", ".join(meta["requirements"]),
|
||||
)
|
||||
if name is not None:
|
||||
if not name.isidentifier():
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a valid module name. "
|
||||
"This is required so it can be imported as a module.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://docs.python.org/3/reference/lexical_analysis.html#identifiers",
|
||||
exits=1,
|
||||
)
|
||||
if not _is_permitted_package_name(name):
|
||||
msg.fail(
|
||||
f"Model name ('{name}') is not a permitted package name. "
|
||||
"This is required to correctly load the model with spacy.load.",
|
||||
"We recommend names that use ASCII A-Z, a-z, _ (underscore), "
|
||||
"and 0-9. "
|
||||
"For specific details see: https://www.python.org/dev/peps/pep-0426/#name",
|
||||
exits=1,
|
||||
)
|
||||
meta["name"] = name
|
||||
if version is not None:
|
||||
meta["version"] = version
|
||||
|
@ -162,7 +181,7 @@ def package(
|
|||
imports="\n".join(f"from . import {m}" for m in imports)
|
||||
)
|
||||
create_file(package_path / "__init__.py", init_py)
|
||||
msg.good(f"Successfully created package '{model_name_v}'", main_path)
|
||||
msg.good(f"Successfully created package directory '{model_name_v}'", main_path)
|
||||
if create_sdist:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "sdist"], capture=False)
|
||||
|
@ -171,8 +190,14 @@ def package(
|
|||
if create_wheel:
|
||||
with util.working_dir(main_path):
|
||||
util.run_command([sys.executable, "setup.py", "bdist_wheel"], capture=False)
|
||||
wheel = main_path / "dist" / f"{model_name_v}{WHEEL_SUFFIX}"
|
||||
wheel_name_squashed = re.sub("_+", "_", model_name_v)
|
||||
wheel = main_path / "dist" / f"{wheel_name_squashed}{WHEEL_SUFFIX}"
|
||||
msg.good(f"Successfully created binary wheel", wheel)
|
||||
if "__" in model_name:
|
||||
msg.warn(
|
||||
f"Model name ('{model_name}') contains a run of underscores. "
|
||||
"Runs of underscores are not significant in installed package names.",
|
||||
)
|
||||
|
||||
|
||||
def has_wheel() -> bool:
|
||||
|
@ -422,6 +447,14 @@ def _format_label_scheme(data: Dict[str, Any]) -> str:
|
|||
return md.text
|
||||
|
||||
|
||||
def _is_permitted_package_name(package_name: str) -> bool:
|
||||
# regex from: https://www.python.org/dev/peps/pep-0426/#name
|
||||
permitted_match = re.search(
|
||||
r"^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$", package_name, re.IGNORECASE
|
||||
)
|
||||
return permitted_match is not None
|
||||
|
||||
|
||||
TEMPLATE_SETUP = """
|
||||
#!/usr/bin/env python
|
||||
import io
|
||||
|
|
|
@ -3,6 +3,7 @@ the docs and the init config command. It encodes various best practices and
|
|||
can help generate the best possible configuration, given a user's requirements. #}
|
||||
{%- set use_transformer = hardware != "cpu" -%}
|
||||
{%- set transformer = transformer_data[optimize] if use_transformer else {} -%}
|
||||
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "spancat", "trainable_lemmatizer"] -%}
|
||||
[paths]
|
||||
train = null
|
||||
dev = null
|
||||
|
@ -24,10 +25,10 @@ lang = "{{ lang }}"
|
|||
{%- set has_textcat = ("textcat" in components or "textcat_multilabel" in components) -%}
|
||||
{%- set with_accuracy = optimize == "accuracy" -%}
|
||||
{%- set has_accurate_textcat = has_textcat and with_accuracy -%}
|
||||
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "entity_linker" in components or has_accurate_textcat) -%}
|
||||
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components %}
|
||||
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "spancat" in components or "trainable_lemmatizer" in components or "entity_linker" in components or has_accurate_textcat) -%}
|
||||
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components -%}
|
||||
{%- else -%}
|
||||
{%- set full_pipeline = components %}
|
||||
{%- set full_pipeline = components -%}
|
||||
{%- endif %}
|
||||
pipeline = {{ full_pipeline|pprint()|replace("'", '"')|safe }}
|
||||
batch_size = {{ 128 if hardware == "gpu" else 1000 }}
|
||||
|
@ -54,7 +55,7 @@ stride = 96
|
|||
factory = "morphologizer"
|
||||
|
||||
[components.morphologizer.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
|
||||
[components.morphologizer.model.tok2vec]
|
||||
|
@ -70,7 +71,7 @@ grad_factor = 1.0
|
|||
factory = "tagger"
|
||||
|
||||
[components.tagger.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
|
||||
[components.tagger.model.tok2vec]
|
||||
|
@ -123,6 +124,60 @@ grad_factor = 1.0
|
|||
@layers = "reduce_mean.v1"
|
||||
{% endif -%}
|
||||
|
||||
{% if "spancat" in components -%}
|
||||
[components.spancat]
|
||||
factory = "spancat"
|
||||
max_positive = null
|
||||
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
|
||||
spans_key = "sc"
|
||||
threshold = 0.5
|
||||
|
||||
[components.spancat.model]
|
||||
@architectures = "spacy.SpanCategorizer.v1"
|
||||
|
||||
[components.spancat.model.reducer]
|
||||
@layers = "spacy.mean_max_reducer.v1"
|
||||
hidden_size = 128
|
||||
|
||||
[components.spancat.model.scorer]
|
||||
@layers = "spacy.LinearLogistic.v1"
|
||||
nO = null
|
||||
nI = null
|
||||
|
||||
[components.spancat.model.tok2vec]
|
||||
@architectures = "spacy-transformers.TransformerListener.v1"
|
||||
grad_factor = 1.0
|
||||
|
||||
[components.spancat.model.tok2vec.pooling]
|
||||
@layers = "reduce_mean.v1"
|
||||
|
||||
[components.spancat.suggester]
|
||||
@misc = "spacy.ngram_suggester.v1"
|
||||
sizes = [1,2,3]
|
||||
{% endif -%}
|
||||
|
||||
{% if "trainable_lemmatizer" in components -%}
|
||||
[components.trainable_lemmatizer]
|
||||
factory = "trainable_lemmatizer"
|
||||
backoff = "orth"
|
||||
min_tree_freq = 3
|
||||
overwrite = false
|
||||
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
|
||||
top_k = 1
|
||||
|
||||
[components.trainable_lemmatizer.model]
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
normalize = false
|
||||
|
||||
[components.trainable_lemmatizer.model.tok2vec]
|
||||
@architectures = "spacy-transformers.TransformerListener.v1"
|
||||
grad_factor = 1.0
|
||||
|
||||
[components.trainable_lemmatizer.model.tok2vec.pooling]
|
||||
@layers = "reduce_mean.v1"
|
||||
{% endif -%}
|
||||
|
||||
{% if "entity_linker" in components -%}
|
||||
[components.entity_linker]
|
||||
factory = "entity_linker"
|
||||
|
@ -131,7 +186,7 @@ incl_context = true
|
|||
incl_prior = true
|
||||
|
||||
[components.entity_linker.model]
|
||||
@architectures = "spacy.EntityLinker.v1"
|
||||
@architectures = "spacy.EntityLinker.v2"
|
||||
nO = null
|
||||
|
||||
[components.entity_linker.model.tok2vec]
|
||||
|
@ -238,7 +293,7 @@ maxout_pieces = 3
|
|||
factory = "morphologizer"
|
||||
|
||||
[components.morphologizer.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
|
||||
[components.morphologizer.model.tok2vec]
|
||||
|
@ -251,7 +306,7 @@ width = ${components.tok2vec.model.encode.width}
|
|||
factory = "tagger"
|
||||
|
||||
[components.tagger.model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
|
||||
[components.tagger.model.tok2vec]
|
||||
|
@ -295,6 +350,54 @@ nO = null
|
|||
width = ${components.tok2vec.model.encode.width}
|
||||
{% endif %}
|
||||
|
||||
{% if "spancat" in components %}
|
||||
[components.spancat]
|
||||
factory = "spancat"
|
||||
max_positive = null
|
||||
scorer = {"@scorers":"spacy.spancat_scorer.v1"}
|
||||
spans_key = "sc"
|
||||
threshold = 0.5
|
||||
|
||||
[components.spancat.model]
|
||||
@architectures = "spacy.SpanCategorizer.v1"
|
||||
|
||||
[components.spancat.model.reducer]
|
||||
@layers = "spacy.mean_max_reducer.v1"
|
||||
hidden_size = 128
|
||||
|
||||
[components.spancat.model.scorer]
|
||||
@layers = "spacy.LinearLogistic.v1"
|
||||
nO = null
|
||||
nI = null
|
||||
|
||||
[components.spancat.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
width = ${components.tok2vec.model.encode.width}
|
||||
|
||||
[components.spancat.suggester]
|
||||
@misc = "spacy.ngram_suggester.v1"
|
||||
sizes = [1,2,3]
|
||||
{% endif %}
|
||||
|
||||
{% if "trainable_lemmatizer" in components -%}
|
||||
[components.trainable_lemmatizer]
|
||||
factory = "trainable_lemmatizer"
|
||||
backoff = "orth"
|
||||
min_tree_freq = 3
|
||||
overwrite = false
|
||||
scorer = {"@scorers":"spacy.lemmatizer_scorer.v1"}
|
||||
top_k = 1
|
||||
|
||||
[components.trainable_lemmatizer.model]
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
nO = null
|
||||
normalize = false
|
||||
|
||||
[components.trainable_lemmatizer.model.tok2vec]
|
||||
@architectures = "spacy.Tok2VecListener.v1"
|
||||
width = ${components.tok2vec.model.encode.width}
|
||||
{% endif -%}
|
||||
|
||||
{% if "entity_linker" in components -%}
|
||||
[components.entity_linker]
|
||||
factory = "entity_linker"
|
||||
|
@ -303,7 +406,7 @@ incl_context = true
|
|||
incl_prior = true
|
||||
|
||||
[components.entity_linker.model]
|
||||
@architectures = "spacy.EntityLinker.v1"
|
||||
@architectures = "spacy.EntityLinker.v2"
|
||||
nO = null
|
||||
|
||||
[components.entity_linker.model.tok2vec]
|
||||
|
@ -369,7 +472,7 @@ no_output_layer = false
|
|||
{% endif %}
|
||||
|
||||
{% for pipe in components %}
|
||||
{% if pipe not in ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker"] %}
|
||||
{% if pipe not in listener_components %}
|
||||
{# Other components defined by the user: we just assume they're factories #}
|
||||
[components.{{ pipe }}]
|
||||
factory = "{{ pipe }}"
|
||||
|
|
|
@ -7,7 +7,7 @@ USAGE: https://spacy.io/usage/visualizers
|
|||
from typing import Union, Iterable, Optional, Dict, Any, Callable
|
||||
import warnings
|
||||
|
||||
from .render import DependencyRenderer, EntityRenderer
|
||||
from .render import DependencyRenderer, EntityRenderer, SpanRenderer
|
||||
from ..tokens import Doc, Span
|
||||
from ..errors import Errors, Warnings
|
||||
from ..util import is_in_jupyter
|
||||
|
@ -44,6 +44,7 @@ def render(
|
|||
factories = {
|
||||
"dep": (DependencyRenderer, parse_deps),
|
||||
"ent": (EntityRenderer, parse_ents),
|
||||
"span": (SpanRenderer, parse_spans),
|
||||
}
|
||||
if style not in factories:
|
||||
raise ValueError(Errors.E087.format(style=style))
|
||||
|
@ -55,6 +56,10 @@ def render(
|
|||
renderer_func, converter = factories[style]
|
||||
renderer = renderer_func(options=options)
|
||||
parsed = [converter(doc, options) for doc in docs] if not manual else docs # type: ignore
|
||||
if manual:
|
||||
for doc in docs:
|
||||
if isinstance(doc, dict) and "ents" in doc:
|
||||
doc["ents"] = sorted(doc["ents"], key=lambda x: (x["start"], x["end"]))
|
||||
_html["parsed"] = renderer.render(parsed, page=page, minify=minify).strip() # type: ignore
|
||||
html = _html["parsed"]
|
||||
if RENDER_WRAPPER is not None:
|
||||
|
@ -203,6 +208,42 @@ def parse_ents(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
|||
return {"text": doc.text, "ents": ents, "title": title, "settings": settings}
|
||||
|
||||
|
||||
def parse_spans(doc: Doc, options: Dict[str, Any] = {}) -> Dict[str, Any]:
|
||||
"""Generate spans in [{start: i, end: i, label: 'label'}] format.
|
||||
|
||||
doc (Doc): Document to parse.
|
||||
options (Dict[str, any]): Span-specific visualisation options.
|
||||
RETURNS (dict): Generated span types keyed by text (original text) and spans.
|
||||
"""
|
||||
kb_url_template = options.get("kb_url_template", None)
|
||||
spans_key = options.get("spans_key", "sc")
|
||||
spans = [
|
||||
{
|
||||
"start": span.start_char,
|
||||
"end": span.end_char,
|
||||
"start_token": span.start,
|
||||
"end_token": span.end,
|
||||
"label": span.label_,
|
||||
"kb_id": span.kb_id_ if span.kb_id_ else "",
|
||||
"kb_url": kb_url_template.format(span.kb_id_) if kb_url_template else "#",
|
||||
}
|
||||
for span in doc.spans[spans_key]
|
||||
]
|
||||
tokens = [token.text for token in doc]
|
||||
|
||||
if not spans:
|
||||
warnings.warn(Warnings.W117.format(spans_key=spans_key))
|
||||
title = doc.user_data.get("title", None) if hasattr(doc, "user_data") else None
|
||||
settings = get_doc_settings(doc)
|
||||
return {
|
||||
"text": doc.text,
|
||||
"spans": spans,
|
||||
"title": title,
|
||||
"settings": settings,
|
||||
"tokens": tokens,
|
||||
}
|
||||
|
||||
|
||||
def set_render_wrapper(func: Callable[[str], str]) -> None:
|
||||
"""Set an optional wrapper function that is called around the generated
|
||||
HTML markup on displacy.render. This can be used to allow integration into
|
||||
|
|
|
@ -1,12 +1,15 @@
|
|||
from typing import Dict, Any, List, Optional, Union
|
||||
from typing import Any, Dict, List, Optional, Tuple, Union
|
||||
import uuid
|
||||
import itertools
|
||||
|
||||
from .templates import TPL_DEP_SVG, TPL_DEP_WORDS, TPL_DEP_WORDS_LEMMA, TPL_DEP_ARCS
|
||||
from .templates import TPL_ENT, TPL_ENT_RTL, TPL_FIGURE, TPL_TITLE, TPL_PAGE
|
||||
from .templates import TPL_ENTS, TPL_KB_LINK
|
||||
from ..util import minify_html, escape_html, registry
|
||||
from ..errors import Errors
|
||||
|
||||
from ..util import escape_html, minify_html, registry
|
||||
from .templates import TPL_DEP_ARCS, TPL_DEP_SVG, TPL_DEP_WORDS
|
||||
from .templates import TPL_DEP_WORDS_LEMMA, TPL_ENT, TPL_ENT_RTL, TPL_ENTS
|
||||
from .templates import TPL_FIGURE, TPL_KB_LINK, TPL_PAGE, TPL_SPAN
|
||||
from .templates import TPL_SPAN_RTL, TPL_SPAN_SLICE, TPL_SPAN_SLICE_RTL
|
||||
from .templates import TPL_SPAN_START, TPL_SPAN_START_RTL, TPL_SPANS
|
||||
from .templates import TPL_TITLE
|
||||
|
||||
DEFAULT_LANG = "en"
|
||||
DEFAULT_DIR = "ltr"
|
||||
|
@ -33,6 +36,168 @@ DEFAULT_LABEL_COLORS = {
|
|||
}
|
||||
|
||||
|
||||
class SpanRenderer:
|
||||
"""Render Spans as SVGs."""
|
||||
|
||||
style = "span"
|
||||
|
||||
def __init__(self, options: Dict[str, Any] = {}) -> None:
|
||||
"""Initialise span renderer
|
||||
|
||||
options (dict): Visualiser-specific options (colors, spans)
|
||||
"""
|
||||
# Set up the colors and overall look
|
||||
colors = dict(DEFAULT_LABEL_COLORS)
|
||||
user_colors = registry.displacy_colors.get_all()
|
||||
for user_color in user_colors.values():
|
||||
if callable(user_color):
|
||||
# Since this comes from the function registry, we want to make
|
||||
# sure we support functions that *return* a dict of colors
|
||||
user_color = user_color()
|
||||
if not isinstance(user_color, dict):
|
||||
raise ValueError(Errors.E925.format(obj=type(user_color)))
|
||||
colors.update(user_color)
|
||||
colors.update(options.get("colors", {}))
|
||||
self.default_color = DEFAULT_ENTITY_COLOR
|
||||
self.colors = {label.upper(): color for label, color in colors.items()}
|
||||
|
||||
# Set up how the text and labels will be rendered
|
||||
self.direction = DEFAULT_DIR
|
||||
self.lang = DEFAULT_LANG
|
||||
self.top_offset = options.get("top_offset", 40)
|
||||
self.top_offset_step = options.get("top_offset_step", 17)
|
||||
|
||||
# Set up which templates will be used
|
||||
template = options.get("template")
|
||||
if template:
|
||||
self.span_template = template["span"]
|
||||
self.span_slice_template = template["slice"]
|
||||
self.span_start_template = template["start"]
|
||||
else:
|
||||
if self.direction == "rtl":
|
||||
self.span_template = TPL_SPAN_RTL
|
||||
self.span_slice_template = TPL_SPAN_SLICE_RTL
|
||||
self.span_start_template = TPL_SPAN_START_RTL
|
||||
else:
|
||||
self.span_template = TPL_SPAN
|
||||
self.span_slice_template = TPL_SPAN_SLICE
|
||||
self.span_start_template = TPL_SPAN_START
|
||||
|
||||
def render(
|
||||
self, parsed: List[Dict[str, Any]], page: bool = False, minify: bool = False
|
||||
) -> str:
|
||||
"""Render complete markup.
|
||||
|
||||
parsed (list): Dependency parses to render.
|
||||
page (bool): Render parses wrapped as full HTML page.
|
||||
minify (bool): Minify HTML markup.
|
||||
RETURNS (str): Rendered HTML markup.
|
||||
"""
|
||||
rendered = []
|
||||
for i, p in enumerate(parsed):
|
||||
if i == 0:
|
||||
settings = p.get("settings", {})
|
||||
self.direction = settings.get("direction", DEFAULT_DIR)
|
||||
self.lang = settings.get("lang", DEFAULT_LANG)
|
||||
rendered.append(self.render_spans(p["tokens"], p["spans"], p.get("title")))
|
||||
|
||||
if page:
|
||||
docs = "".join([TPL_FIGURE.format(content=doc) for doc in rendered])
|
||||
markup = TPL_PAGE.format(content=docs, lang=self.lang, dir=self.direction)
|
||||
else:
|
||||
markup = "".join(rendered)
|
||||
if minify:
|
||||
return minify_html(markup)
|
||||
return markup
|
||||
|
||||
def render_spans(
|
||||
self,
|
||||
tokens: List[str],
|
||||
spans: List[Dict[str, Any]],
|
||||
title: Optional[str],
|
||||
) -> str:
|
||||
"""Render span types in text.
|
||||
|
||||
Spans are rendered per-token, this means that for each token, we check if it's part
|
||||
of a span slice (a member of a span type) or a span start (the starting token of a
|
||||
given span type).
|
||||
|
||||
tokens (list): Individual tokens in the text
|
||||
spans (list): Individual entity spans and their start, end, label, kb_id and kb_url.
|
||||
title (str / None): Document title set in Doc.user_data['title'].
|
||||
"""
|
||||
per_token_info = []
|
||||
for idx, token in enumerate(tokens):
|
||||
# Identify if a token belongs to a Span (and which) and if it's a
|
||||
# start token of said Span. We'll use this for the final HTML render
|
||||
token_markup: Dict[str, Any] = {}
|
||||
token_markup["text"] = token
|
||||
entities = []
|
||||
for span in spans:
|
||||
ent = {}
|
||||
if span["start_token"] <= idx < span["end_token"]:
|
||||
ent["label"] = span["label"]
|
||||
ent["is_start"] = True if idx == span["start_token"] else False
|
||||
kb_id = span.get("kb_id", "")
|
||||
kb_url = span.get("kb_url", "#")
|
||||
ent["kb_link"] = (
|
||||
TPL_KB_LINK.format(kb_id=kb_id, kb_url=kb_url) if kb_id else ""
|
||||
)
|
||||
entities.append(ent)
|
||||
token_markup["entities"] = entities
|
||||
per_token_info.append(token_markup)
|
||||
|
||||
markup = self._render_markup(per_token_info)
|
||||
markup = TPL_SPANS.format(content=markup, dir=self.direction)
|
||||
if title:
|
||||
markup = TPL_TITLE.format(title=title) + markup
|
||||
return markup
|
||||
|
||||
def _render_markup(self, per_token_info: List[Dict[str, Any]]) -> str:
|
||||
"""Render the markup from per-token information"""
|
||||
markup = ""
|
||||
for token in per_token_info:
|
||||
entities = sorted(token["entities"], key=lambda d: d["label"])
|
||||
if entities:
|
||||
slices = self._get_span_slices(token["entities"])
|
||||
starts = self._get_span_starts(token["entities"])
|
||||
markup += self.span_template.format(
|
||||
text=token["text"], span_slices=slices, span_starts=starts
|
||||
)
|
||||
else:
|
||||
markup += escape_html(token["text"] + " ")
|
||||
return markup
|
||||
|
||||
def _get_span_slices(self, entities: List[Dict]) -> str:
|
||||
"""Get the rendered markup of all Span slices"""
|
||||
span_slices = []
|
||||
for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
|
||||
color = self.colors.get(entity["label"].upper(), self.default_color)
|
||||
span_slice = self.span_slice_template.format(
|
||||
bg=color, top_offset=self.top_offset + step
|
||||
)
|
||||
span_slices.append(span_slice)
|
||||
return "".join(span_slices)
|
||||
|
||||
def _get_span_starts(self, entities: List[Dict]) -> str:
|
||||
"""Get the rendered markup of all Span start tokens"""
|
||||
span_starts = []
|
||||
for entity, step in zip(entities, itertools.count(step=self.top_offset_step)):
|
||||
color = self.colors.get(entity["label"].upper(), self.default_color)
|
||||
span_start = (
|
||||
self.span_start_template.format(
|
||||
bg=color,
|
||||
top_offset=self.top_offset + step,
|
||||
label=entity["label"],
|
||||
kb_link=entity["kb_link"],
|
||||
)
|
||||
if entity["is_start"]
|
||||
else ""
|
||||
)
|
||||
span_starts.append(span_start)
|
||||
return "".join(span_starts)
|
||||
|
||||
|
||||
class DependencyRenderer:
|
||||
"""Render dependency parses as SVGs."""
|
||||
|
||||
|
@ -105,7 +270,7 @@ class DependencyRenderer:
|
|||
RETURNS (str): Rendered SVG markup.
|
||||
"""
|
||||
self.levels = self.get_levels(arcs)
|
||||
self.highest_level = len(self.levels)
|
||||
self.highest_level = max(self.levels.values(), default=0)
|
||||
self.offset_y = self.distance / 2 * self.highest_level + self.arrow_stroke
|
||||
self.width = self.offset_x + len(words) * self.distance
|
||||
self.height = self.offset_y + 3 * self.word_spacing
|
||||
|
@ -165,7 +330,7 @@ class DependencyRenderer:
|
|||
if start < 0 or end < 0:
|
||||
error_args = dict(start=start, end=end, label=label, dir=direction)
|
||||
raise ValueError(Errors.E157.format(**error_args))
|
||||
level = self.levels.index(end - start) + 1
|
||||
level = self.levels[(start, end, label)]
|
||||
x_start = self.offset_x + start * self.distance + self.arrow_spacing
|
||||
if self.direction == "rtl":
|
||||
x_start = self.width - x_start
|
||||
|
@ -181,7 +346,7 @@ class DependencyRenderer:
|
|||
y_curve = self.offset_y - level * self.distance / 2
|
||||
if self.compact:
|
||||
y_curve = self.offset_y - level * self.distance / 6
|
||||
if y_curve == 0 and len(self.levels) > 5:
|
||||
if y_curve == 0 and max(self.levels.values(), default=0) > 5:
|
||||
y_curve = -self.distance
|
||||
arrowhead = self.get_arrowhead(direction, x_start, y, x_end)
|
||||
arc = self.get_arc(x_start, y, y_curve, x_end)
|
||||
|
@ -225,15 +390,23 @@ class DependencyRenderer:
|
|||
p1, p2, p3 = (end, end + self.arrow_width - 2, end - self.arrow_width + 2)
|
||||
return f"M{p1},{y + 2} L{p2},{y - self.arrow_width} {p3},{y - self.arrow_width}"
|
||||
|
||||
def get_levels(self, arcs: List[Dict[str, Any]]) -> List[int]:
|
||||
def get_levels(self, arcs: List[Dict[str, Any]]) -> Dict[Tuple[int, int, str], int]:
|
||||
"""Calculate available arc height "levels".
|
||||
Used to calculate arrow heights dynamically and without wasting space.
|
||||
|
||||
args (list): Individual arcs and their start, end, direction and label.
|
||||
RETURNS (list): Arc levels sorted from lowest to highest.
|
||||
RETURNS (dict): Arc levels keyed by (start, end, label).
|
||||
"""
|
||||
levels = set(map(lambda arc: arc["end"] - arc["start"], arcs))
|
||||
return sorted(list(levels))
|
||||
arcs = [dict(t) for t in {tuple(sorted(arc.items())) for arc in arcs}]
|
||||
length = max([arc["end"] for arc in arcs], default=0)
|
||||
max_level = [0] * length
|
||||
levels = {}
|
||||
for arc in sorted(arcs, key=lambda arc: arc["end"] - arc["start"]):
|
||||
level = max(max_level[arc["start"] : arc["end"]]) + 1
|
||||
for i in range(arc["start"], arc["end"]):
|
||||
max_level[i] = level
|
||||
levels[(arc["start"], arc["end"], arc["label"])] = level
|
||||
return levels
|
||||
|
||||
|
||||
class EntityRenderer:
|
||||
|
@ -242,7 +415,7 @@ class EntityRenderer:
|
|||
style = "ent"
|
||||
|
||||
def __init__(self, options: Dict[str, Any] = {}) -> None:
|
||||
"""Initialise dependency renderer.
|
||||
"""Initialise entity renderer.
|
||||
|
||||
options (dict): Visualiser-specific options (colors, ents)
|
||||
"""
|
||||
|
|
|
@ -62,6 +62,55 @@ TPL_ENT_RTL = """
|
|||
</mark>
|
||||
"""
|
||||
|
||||
TPL_SPANS = """
|
||||
<div class="spans" style="line-height: 2.5; direction: {dir}">{content}</div>
|
||||
"""
|
||||
|
||||
TPL_SPAN = """
|
||||
<span style="font-weight: bold; display: inline-block; position: relative;">
|
||||
{text}
|
||||
{span_slices}
|
||||
{span_starts}
|
||||
</span>
|
||||
"""
|
||||
|
||||
TPL_SPAN_SLICE = """
|
||||
<span style="background: {bg}; top: {top_offset}px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;">
|
||||
</span>
|
||||
"""
|
||||
|
||||
|
||||
TPL_SPAN_START = """
|
||||
<span style="background: {bg}; top: {top_offset}px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;">
|
||||
<span style="background: {bg}; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px">
|
||||
{label}{kb_link}
|
||||
</span>
|
||||
</span>
|
||||
|
||||
"""
|
||||
|
||||
TPL_SPAN_RTL = """
|
||||
<span style="font-weight: bold; display: inline-block; position: relative;">
|
||||
{text}
|
||||
{span_slices}
|
||||
{span_starts}
|
||||
</span>
|
||||
"""
|
||||
|
||||
TPL_SPAN_SLICE_RTL = """
|
||||
<span style="background: {bg}; top: {top_offset}px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;">
|
||||
</span>
|
||||
"""
|
||||
|
||||
TPL_SPAN_START_RTL = """
|
||||
<span style="background: {bg}; top: {top_offset}px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;">
|
||||
<span style="background: {bg}; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px">
|
||||
{label}{kb_link}
|
||||
</span>
|
||||
</span>
|
||||
"""
|
||||
|
||||
|
||||
# Important: this needs to start with a space!
|
||||
TPL_KB_LINK = """
|
||||
<a style="text-decoration: none; color: inherit; font-weight: normal" href="{kb_url}">{kb_id}</a>
|
||||
|
|
|
@ -192,6 +192,13 @@ class Warnings(metaclass=ErrorsWithCodes):
|
|||
W115 = ("Skipping {method}: the floret vector table cannot be modified. "
|
||||
"Vectors are calculated from character ngrams.")
|
||||
W116 = ("Unable to clean attribute '{attr}'.")
|
||||
W117 = ("No spans to visualize found in Doc object with spans_key: '{spans_key}'. If this is "
|
||||
"surprising to you, make sure the Doc was processed using a model "
|
||||
"that supports span categorization, and check the `doc.spans[spans_key]` "
|
||||
"property manually if necessary.")
|
||||
W118 = ("Term '{term}' not found in glossary. It may however be explained in documentation "
|
||||
"for the corpora used to train the language. Please check "
|
||||
"`nlp.meta[\"sources\"]` for any relevant links.")
|
||||
|
||||
|
||||
class Errors(metaclass=ErrorsWithCodes):
|
||||
|
@ -483,7 +490,7 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
"components, since spans are only views of the Doc. Use Doc and "
|
||||
"Token attributes (or custom extension attributes) only and remove "
|
||||
"the following: {attrs}")
|
||||
E181 = ("Received invalid attributes for unkown object {obj}: {attrs}. "
|
||||
E181 = ("Received invalid attributes for unknown object {obj}: {attrs}. "
|
||||
"Only Doc and Token attributes are supported.")
|
||||
E182 = ("Received invalid attribute declaration: {attr}\nDid you forget "
|
||||
"to define the attribute? For example: `{attr}.???`")
|
||||
|
@ -520,10 +527,14 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E202 = ("Unsupported {name} mode '{mode}'. Supported modes: {modes}.")
|
||||
|
||||
# New errors added in v3.x
|
||||
E855 = ("Invalid {obj}: {obj} is not from the same doc.")
|
||||
E856 = ("Error accessing span at position {i}: out of bounds in span group "
|
||||
"of length {length}.")
|
||||
E857 = ("Entry '{name}' not found in edit tree lemmatizer labels.")
|
||||
E858 = ("The {mode} vector table does not support this operation. "
|
||||
"{alternative}")
|
||||
E859 = ("The floret vector table cannot be modified.")
|
||||
E860 = ("Can't truncate fasttext-bloom vectors.")
|
||||
E860 = ("Can't truncate floret vectors.")
|
||||
E861 = ("No 'keys' should be provided when initializing floret vectors "
|
||||
"with 'minn' and 'maxn'.")
|
||||
E862 = ("'hash_count' must be between 1-4 for floret vectors.")
|
||||
|
@ -566,9 +577,6 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
E879 = ("Unexpected type for 'spans' data. Provide a dictionary mapping keys to "
|
||||
"a list of spans, with each span represented by a tuple (start_char, end_char). "
|
||||
"The tuple can be optionally extended with a label and a KB ID.")
|
||||
E880 = ("The 'wandb' library could not be found - did you install it? "
|
||||
"Alternatively, specify the 'ConsoleLogger' in the 'training.logger' "
|
||||
"config section, instead of the 'WandbLogger'.")
|
||||
E884 = ("The pipeline could not be initialized because the vectors "
|
||||
"could not be found at '{vectors}'. If your pipeline was already "
|
||||
"initialized/trained before, call 'resume_training' instead of 'initialize', "
|
||||
|
@ -894,6 +902,17 @@ class Errors(metaclass=ErrorsWithCodes):
|
|||
"patterns.")
|
||||
E1025 = ("Cannot intify the value '{value}' as an IOB string. The only "
|
||||
"supported values are: 'I', 'O', 'B' and ''")
|
||||
E1026 = ("Edit tree has an invalid format:\n{errors}")
|
||||
E1027 = ("AlignmentArray only supports slicing with a step of 1.")
|
||||
E1028 = ("AlignmentArray only supports indexing using an int or a slice.")
|
||||
E1029 = ("Edit tree cannot be applied to form.")
|
||||
E1030 = ("Edit tree identifier out of range.")
|
||||
E1031 = ("Could not find gold transition - see logs above.")
|
||||
E1032 = ("`{var}` should not be {forbidden}, but received {value}.")
|
||||
E1033 = ("Dimension {name} invalid -- only nO, nF, nP")
|
||||
E1034 = ("Node index {i} out of bounds ({length})")
|
||||
E1035 = ("Token index {i} out of bounds ({length})")
|
||||
E1036 = ("Cannot index into NoneNode")
|
||||
|
||||
|
||||
# Deprecated model shortcuts, only used in errors and warnings
|
||||
|
|
|
@ -1,3 +1,7 @@
|
|||
import warnings
|
||||
from .errors import Warnings
|
||||
|
||||
|
||||
def explain(term):
|
||||
"""Get a description for a given POS tag, dependency label or entity type.
|
||||
|
||||
|
@ -11,6 +15,8 @@ def explain(term):
|
|||
"""
|
||||
if term in GLOSSARY:
|
||||
return GLOSSARY[term]
|
||||
else:
|
||||
warnings.warn(Warnings.W118.format(term=term))
|
||||
|
||||
|
||||
GLOSSARY = {
|
||||
|
@ -310,7 +316,6 @@ GLOSSARY = {
|
|||
"re": "repeated element",
|
||||
"rs": "reported speech",
|
||||
"sb": "subject",
|
||||
"sb": "subject",
|
||||
"sbp": "passivized subject (PP)",
|
||||
"sp": "subject or predicate",
|
||||
"svp": "separable verb prefix",
|
||||
|
|
16
spacy/lang/dsb/__init__.py
Normal file
16
spacy/lang/dsb/__init__.py
Normal file
|
@ -0,0 +1,16 @@
|
|||
from .lex_attrs import LEX_ATTRS
|
||||
from .stop_words import STOP_WORDS
|
||||
from ...language import Language, BaseDefaults
|
||||
|
||||
|
||||
class LowerSorbianDefaults(BaseDefaults):
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
|
||||
|
||||
class LowerSorbian(Language):
|
||||
lang = "dsb"
|
||||
Defaults = LowerSorbianDefaults
|
||||
|
||||
|
||||
__all__ = ["LowerSorbian"]
|
15
spacy/lang/dsb/examples.py
Normal file
15
spacy/lang/dsb/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.dsb.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Z tym stwori so wuměnjenje a zakład za dalše wobdźěłanje přez analyzu tekstoweje struktury a semantisku anotaciju a z tym tež za tu předstajenu digitalnu online-wersiju.",
|
||||
"Mi so tu jara derje spodoba.",
|
||||
"Kotre nowniny chceće měć?",
|
||||
"Tak ako w slědnem lěśe jo teke lětosa jano doma zapustowaś móžno.",
|
||||
"Zwóstanjo pótakem hyšći wjele źěła.",
|
||||
]
|
113
spacy/lang/dsb/lex_attrs.py
Normal file
113
spacy/lang/dsb/lex_attrs.py
Normal file
|
@ -0,0 +1,113 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"nul",
|
||||
"jaden",
|
||||
"jadna",
|
||||
"jadno",
|
||||
"dwa",
|
||||
"dwě",
|
||||
"tśi",
|
||||
"tśo",
|
||||
"styri",
|
||||
"styrjo",
|
||||
"pěś",
|
||||
"pěśo",
|
||||
"šesć",
|
||||
"šesćo",
|
||||
"sedym",
|
||||
"sedymjo",
|
||||
"wósym",
|
||||
"wósymjo",
|
||||
"źewjeś",
|
||||
"źewjeśo",
|
||||
"źaseś",
|
||||
"źaseśo",
|
||||
"jadnassćo",
|
||||
"dwanassćo",
|
||||
"tśinasćo",
|
||||
"styrnasćo",
|
||||
"pěśnasćo",
|
||||
"šesnasćo",
|
||||
"sedymnasćo",
|
||||
"wósymnasćo",
|
||||
"źewjeśnasćo",
|
||||
"dwanasćo",
|
||||
"dwaźasća",
|
||||
"tśiźasća",
|
||||
"styrźasća",
|
||||
"pěśźaset",
|
||||
"šesćźaset",
|
||||
"sedymźaset",
|
||||
"wósymźaset",
|
||||
"źewjeśźaset",
|
||||
"sto",
|
||||
"tysac",
|
||||
"milion",
|
||||
"miliarda",
|
||||
"bilion",
|
||||
"biliarda",
|
||||
"trilion",
|
||||
"triliarda",
|
||||
]
|
||||
|
||||
_ordinal_words = [
|
||||
"prědny",
|
||||
"prědna",
|
||||
"prědne",
|
||||
"drugi",
|
||||
"druga",
|
||||
"druge",
|
||||
"tśeśi",
|
||||
"tśeśa",
|
||||
"tśeśe",
|
||||
"stwórty",
|
||||
"stwórta",
|
||||
"stwórte",
|
||||
"pêty",
|
||||
"pěta",
|
||||
"pête",
|
||||
"šesty",
|
||||
"šesta",
|
||||
"šeste",
|
||||
"sedymy",
|
||||
"sedyma",
|
||||
"sedyme",
|
||||
"wósymy",
|
||||
"wósyma",
|
||||
"wósyme",
|
||||
"źewjety",
|
||||
"źewjeta",
|
||||
"źewjete",
|
||||
"źasety",
|
||||
"źaseta",
|
||||
"źasete",
|
||||
"jadnasty",
|
||||
"jadnasta",
|
||||
"jadnaste",
|
||||
"dwanasty",
|
||||
"dwanasta",
|
||||
"dwanaste",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
text_lower = text.lower()
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
# Check ordinal number
|
||||
if text_lower in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
15
spacy/lang/dsb/stop_words.py
Normal file
15
spacy/lang/dsb/stop_words.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
a abo aby ako ale až
|
||||
|
||||
daniž dokulaž
|
||||
|
||||
gaž
|
||||
|
||||
jolic
|
||||
|
||||
pak pótom
|
||||
|
||||
teke togodla
|
||||
""".split()
|
||||
)
|
|
@ -447,7 +447,6 @@ for exc_data in [
|
|||
{ORTH: "La.", NORM: "Louisiana"},
|
||||
{ORTH: "Mar.", NORM: "March"},
|
||||
{ORTH: "Mass.", NORM: "Massachusetts"},
|
||||
{ORTH: "May.", NORM: "May"},
|
||||
{ORTH: "Mich.", NORM: "Michigan"},
|
||||
{ORTH: "Minn.", NORM: "Minnesota"},
|
||||
{ORTH: "Miss.", NORM: "Mississippi"},
|
||||
|
|
|
@ -9,14 +9,14 @@ Example sentences to test spaCy and its language models.
|
|||
sentences = [
|
||||
"Apple está buscando comprar una startup del Reino Unido por mil millones de dólares.",
|
||||
"Los coches autónomos delegan la responsabilidad del seguro en sus fabricantes.",
|
||||
"San Francisco analiza prohibir los robots delivery.",
|
||||
"San Francisco analiza prohibir los robots de reparto.",
|
||||
"Londres es una gran ciudad del Reino Unido.",
|
||||
"El gato come pescado.",
|
||||
"Veo al hombre con el telescopio.",
|
||||
"La araña come moscas.",
|
||||
"El pingüino incuba en su nido sobre el hielo.",
|
||||
"¿Dónde estais?",
|
||||
"¿Quién es el presidente Francés?",
|
||||
"¿Dónde está encuentra la capital de Argentina?",
|
||||
"¿Dónde estáis?",
|
||||
"¿Quién es el presidente francés?",
|
||||
"¿Dónde se encuentra la capital de Argentina?",
|
||||
"¿Cuándo nació José de San Martín?",
|
||||
]
|
||||
|
|
|
@ -1,82 +1,80 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
actualmente acuerdo adelante ademas además adrede afirmó agregó ahi ahora ahí
|
||||
al algo alguna algunas alguno algunos algún alli allí alrededor ambos ampleamos
|
||||
antano antaño ante anterior antes apenas aproximadamente aquel aquella aquellas
|
||||
aquello aquellos aqui aquél aquélla aquéllas aquéllos aquí arriba arribaabajo
|
||||
aseguró asi así atras aun aunque ayer añadió aún
|
||||
a acuerdo adelante ademas además afirmó agregó ahi ahora ahí al algo alguna
|
||||
algunas alguno algunos algún alli allí alrededor ambos ante anterior antes
|
||||
apenas aproximadamente aquel aquella aquellas aquello aquellos aqui aquél
|
||||
aquélla aquéllas aquéllos aquí arriba aseguró asi así atras aun aunque añadió
|
||||
aún
|
||||
|
||||
bajo bastante bien breve buen buena buenas bueno buenos
|
||||
|
||||
cada casi cerca cierta ciertas cierto ciertos cinco claro comentó como con
|
||||
conmigo conocer conseguimos conseguir considera consideró consigo consigue
|
||||
consiguen consigues contigo contra cosas creo cual cuales cualquier cuando
|
||||
cuanta cuantas cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas
|
||||
cuánto cuántos cómo
|
||||
cada casi cierta ciertas cierto ciertos cinco claro comentó como con conmigo
|
||||
conocer conseguimos conseguir considera consideró consigo consigue consiguen
|
||||
consigues contigo contra creo cual cuales cualquier cuando cuanta cuantas
|
||||
cuanto cuantos cuatro cuenta cuál cuáles cuándo cuánta cuántas cuánto cuántos
|
||||
cómo
|
||||
|
||||
da dado dan dar de debajo debe deben debido decir dejó del delante demasiado
|
||||
demás dentro deprisa desde despacio despues después detras detrás dia dias dice
|
||||
dicen dicho dieron diferente diferentes dijeron dijo dio donde dos durante día
|
||||
días dónde
|
||||
dicen dicho dieron diez diferente diferentes dijeron dijo dio doce donde dos
|
||||
durante día días dónde
|
||||
|
||||
ejemplo el ella ellas ello ellos embargo empleais emplean emplear empleas
|
||||
empleo en encima encuentra enfrente enseguida entonces entre era eramos eran
|
||||
eras eres es esa esas ese eso esos esta estaba estaban estado estados estais
|
||||
estamos estan estar estará estas este esto estos estoy estuvo está están ex
|
||||
excepto existe existen explicó expresó él ésa ésas ése ésos ésta éstas éste
|
||||
éstos
|
||||
e el ella ellas ello ellos embargo en encima encuentra enfrente enseguida
|
||||
entonces entre era eramos eran eras eres es esa esas ese eso esos esta estaba
|
||||
estaban estado estados estais estamos estan estar estará estas este esto estos
|
||||
estoy estuvo está están excepto existe existen explicó expresó él ésa ésas ése
|
||||
ésos ésta éstas éste éstos
|
||||
|
||||
fin final fue fuera fueron fui fuimos
|
||||
|
||||
general gran grandes gueno
|
||||
gran grande grandes
|
||||
|
||||
ha haber habia habla hablan habrá había habían hace haceis hacemos hacen hacer
|
||||
hacerlo haces hacia haciendo hago han hasta hay haya he hecho hemos hicieron
|
||||
hizo horas hoy hubo
|
||||
hizo hoy hubo
|
||||
|
||||
igual incluso indicó informo informó intenta intentais intentamos intentan
|
||||
intentar intentas intento ir
|
||||
igual incluso indicó informo informó ir
|
||||
|
||||
junto
|
||||
|
||||
la lado largo las le lejos les llegó lleva llevar lo los luego lugar
|
||||
la lado largo las le les llegó lleva llevar lo los luego
|
||||
|
||||
mal manera manifestó mas mayor me mediante medio mejor mencionó menos menudo mi
|
||||
mia mias mientras mio mios mis misma mismas mismo mismos modo momento mucha
|
||||
muchas mucho muchos muy más mí mía mías mío míos
|
||||
mia mias mientras mio mios mis misma mismas mismo mismos modo mucha muchas
|
||||
mucho muchos muy más mí mía mías mío míos
|
||||
|
||||
nada nadie ni ninguna ningunas ninguno ningunos ningún no nos nosotras nosotros
|
||||
nuestra nuestras nuestro nuestros nueva nuevas nuevo nuevos nunca
|
||||
nuestra nuestras nuestro nuestros nueva nuevas nueve nuevo nuevos nunca
|
||||
|
||||
ocho os otra otras otro otros
|
||||
o ocho once os otra otras otro otros
|
||||
|
||||
pais para parece parte partir pasada pasado paìs peor pero pesar poca pocas
|
||||
poco pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
|
||||
para parece parte partir pasada pasado paìs peor pero pesar poca pocas poco
|
||||
pocos podeis podemos poder podria podriais podriamos podrian podrias podrá
|
||||
podrán podría podrían poner por porque posible primer primera primero primeros
|
||||
principalmente pronto propia propias propio propios proximo próximo próximos
|
||||
pudo pueda puede pueden puedo pues
|
||||
pronto propia propias propio propios proximo próximo próximos pudo pueda puede
|
||||
pueden puedo pues
|
||||
|
||||
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién quiénes qué
|
||||
qeu que quedó queremos quien quienes quiere quiza quizas quizá quizás quién
|
||||
quiénes qué
|
||||
|
||||
raras realizado realizar realizó repente respecto
|
||||
realizado realizar realizó repente respecto
|
||||
|
||||
sabe sabeis sabemos saben saber sabes salvo se sea sean segun segunda segundo
|
||||
según seis ser sera será serán sería señaló si sido siempre siendo siete sigue
|
||||
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy
|
||||
soyos su supuesto sus suya suyas suyo sé sí sólo
|
||||
siguiente sin sino sobre sois sola solamente solas solo solos somos son soy su
|
||||
supuesto sus suya suyas suyo suyos sé sí sólo
|
||||
|
||||
tal tambien también tampoco tan tanto tarde te temprano tendrá tendrán teneis
|
||||
tenemos tener tenga tengo tenido tenía tercera ti tiempo tiene tienen toda
|
||||
todas todavia todavía todo todos total trabaja trabajais trabajamos trabajan
|
||||
trabajar trabajas trabajo tras trata través tres tu tus tuvo tuya tuyas tuyo
|
||||
tuyos tú
|
||||
tenemos tener tenga tengo tenido tenía tercera tercero ti tiene tienen toda
|
||||
todas todavia todavía todo todos total tras trata través tres tu tus tuvo tuya
|
||||
tuyas tuyo tuyos tú
|
||||
|
||||
ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
|
||||
u ultimo un una unas uno unos usa usais usamos usan usar usas uso usted ustedes
|
||||
última últimas último últimos
|
||||
|
||||
va vais valor vamos van varias varios vaya veces ver verdad verdadera verdadero
|
||||
vez vosotras vosotros voy vuestra vuestras vuestro vuestros
|
||||
va vais vamos van varias varios vaya veces ver verdad verdadera verdadero vez
|
||||
vosotras vosotros voy vuestra vuestras vuestro vuestros
|
||||
|
||||
ya yo
|
||||
y ya yo
|
||||
""".split()
|
||||
)
|
||||
|
|
|
@ -3,7 +3,7 @@ from ...attrs import LIKE_NUM
|
|||
|
||||
_num_words = set(
|
||||
"""
|
||||
zero un deux trois quatre cinq six sept huit neuf dix
|
||||
zero un une deux trois quatre cinq six sept huit neuf dix
|
||||
onze douze treize quatorze quinze seize dix-sept dix-huit dix-neuf
|
||||
vingt trente quarante cinquante soixante soixante-dix septante quatre-vingt huitante quatre-vingt-dix nonante
|
||||
cent mille mil million milliard billion quadrillion quintillion
|
||||
|
@ -13,7 +13,7 @@ sextillion septillion octillion nonillion decillion
|
|||
|
||||
_ordinal_words = set(
|
||||
"""
|
||||
premier deuxième second troisième quatrième cinquième sixième septième huitième neuvième dixième
|
||||
premier première deuxième second seconde troisième quatrième cinquième sixième septième huitième neuvième dixième
|
||||
onzième douzième treizième quatorzième quinzième seizième dix-septième dix-huitième dix-neuvième
|
||||
vingtième trentième quarantième cinquantième soixantième soixante-dixième septantième quatre-vingtième huitantième quatre-vingt-dixième nonantième
|
||||
centième millième millionnième milliardième billionnième quadrillionnième quintillionnième
|
||||
|
|
|
@ -64,9 +64,7 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
|
|||
prev_end = right_end.i
|
||||
|
||||
left_index = word.left_edge.i
|
||||
left_index = (
|
||||
left_index + 1 if word.left_edge.pos == adp_pos else left_index
|
||||
)
|
||||
left_index = left_index + 1 if word.left_edge.pos == adp_pos else left_index
|
||||
|
||||
yield left_index, right_end.i + 1, np_label
|
||||
elif word.dep == conj_label:
|
||||
|
|
18
spacy/lang/hsb/__init__.py
Normal file
18
spacy/lang/hsb/__init__.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from .lex_attrs import LEX_ATTRS
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tokenizer_exceptions import TOKENIZER_EXCEPTIONS
|
||||
from ...language import Language, BaseDefaults
|
||||
|
||||
|
||||
class UpperSorbianDefaults(BaseDefaults):
|
||||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
|
||||
|
||||
|
||||
class UpperSorbian(Language):
|
||||
lang = "hsb"
|
||||
Defaults = UpperSorbianDefaults
|
||||
|
||||
|
||||
__all__ = ["UpperSorbian"]
|
15
spacy/lang/hsb/examples.py
Normal file
15
spacy/lang/hsb/examples.py
Normal file
|
@ -0,0 +1,15 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.hsb.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"To běšo wjelgin raźone a jo se wót luźi derje pśiwzeło. Tak som dožywiła wjelgin",
|
||||
"Jogo pśewóźowarce stej groniłej, až how w serbskich stronach njama Santa Claus nic pytaś.",
|
||||
"A ten sobuźěłaśeŕ Statneje biblioteki w Barlinju jo pśimjeł drogotne knigły bźez rukajcowu z nagima rukoma!",
|
||||
"Take wobchadanje z našym kulturnym derbstwom zewšym njejźo.",
|
||||
"Wopśimjeśe drugich pśinoskow jo było na wusokem niwowje, ako pśecej.",
|
||||
]
|
106
spacy/lang/hsb/lex_attrs.py
Normal file
106
spacy/lang/hsb/lex_attrs.py
Normal file
|
@ -0,0 +1,106 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
_num_words = [
|
||||
"nul",
|
||||
"jedyn",
|
||||
"jedna",
|
||||
"jedne",
|
||||
"dwaj",
|
||||
"dwě",
|
||||
"tři",
|
||||
"třo",
|
||||
"štyri",
|
||||
"štyrjo",
|
||||
"pjeć",
|
||||
"šěsć",
|
||||
"sydom",
|
||||
"wosom",
|
||||
"dźewjeć",
|
||||
"dźesać",
|
||||
"jědnaće",
|
||||
"dwanaće",
|
||||
"třinaće",
|
||||
"štyrnaće",
|
||||
"pjatnaće",
|
||||
"šěsnaće",
|
||||
"sydomnaće",
|
||||
"wosomnaće",
|
||||
"dźewjatnaće",
|
||||
"dwaceći",
|
||||
"třiceći",
|
||||
"štyrceći",
|
||||
"pjećdźesat",
|
||||
"šěsćdźesat",
|
||||
"sydomdźesat",
|
||||
"wosomdźesat",
|
||||
"dźewjećdźesat",
|
||||
"sto",
|
||||
"tysac",
|
||||
"milion",
|
||||
"miliarda",
|
||||
"bilion",
|
||||
"biliarda",
|
||||
"trilion",
|
||||
"triliarda",
|
||||
]
|
||||
|
||||
_ordinal_words = [
|
||||
"prěni",
|
||||
"prěnja",
|
||||
"prěnje",
|
||||
"druhi",
|
||||
"druha",
|
||||
"druhe",
|
||||
"třeći",
|
||||
"třeća",
|
||||
"třeće",
|
||||
"štwórty",
|
||||
"štwórta",
|
||||
"štwórte",
|
||||
"pjaty",
|
||||
"pjata",
|
||||
"pjate",
|
||||
"šěsty",
|
||||
"šěsta",
|
||||
"šěste",
|
||||
"sydmy",
|
||||
"sydma",
|
||||
"sydme",
|
||||
"wosmy",
|
||||
"wosma",
|
||||
"wosme",
|
||||
"dźewjaty",
|
||||
"dźewjata",
|
||||
"dźewjate",
|
||||
"dźesaty",
|
||||
"dźesata",
|
||||
"dźesate",
|
||||
"jědnaty",
|
||||
"jědnata",
|
||||
"jědnate",
|
||||
"dwanaty",
|
||||
"dwanata",
|
||||
"dwanate",
|
||||
]
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
if text.count("/") == 1:
|
||||
num, denom = text.split("/")
|
||||
if num.isdigit() and denom.isdigit():
|
||||
return True
|
||||
text_lower = text.lower()
|
||||
if text_lower in _num_words:
|
||||
return True
|
||||
# Check ordinal number
|
||||
if text_lower in _ordinal_words:
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
LEX_ATTRS = {LIKE_NUM: like_num}
|
19
spacy/lang/hsb/stop_words.py
Normal file
19
spacy/lang/hsb/stop_words.py
Normal file
|
@ -0,0 +1,19 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
a abo ale ani
|
||||
|
||||
dokelž
|
||||
|
||||
hdyž
|
||||
|
||||
jeli jelizo
|
||||
|
||||
kaž
|
||||
|
||||
pak potom
|
||||
|
||||
tež tohodla
|
||||
|
||||
zo zoby
|
||||
""".split()
|
||||
)
|
18
spacy/lang/hsb/tokenizer_exceptions.py
Normal file
18
spacy/lang/hsb/tokenizer_exceptions.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
||||
from ...symbols import ORTH, NORM
|
||||
from ...util import update_exc
|
||||
|
||||
_exc = dict()
|
||||
for exc_data in [
|
||||
{ORTH: "mil.", NORM: "milion"},
|
||||
{ORTH: "wob.", NORM: "wobydler"},
|
||||
]:
|
||||
_exc[exc_data[ORTH]] = [exc_data]
|
||||
|
||||
for orth in [
|
||||
"resp.",
|
||||
]:
|
||||
_exc[orth] = [{ORTH: orth}]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
|
@ -1,12 +1,13 @@
|
|||
from typing import Iterator, Any, Dict
|
||||
|
||||
from .punctuation import TOKENIZER_INFIXES
|
||||
from .stop_words import STOP_WORDS
|
||||
from .tag_map import TAG_MAP
|
||||
from .lex_attrs import LEX_ATTRS
|
||||
from ...language import Language, BaseDefaults
|
||||
from ...tokens import Doc
|
||||
from ...scorer import Scorer
|
||||
from ...symbols import POS
|
||||
from ...symbols import POS, X
|
||||
from ...training import validate_examples
|
||||
from ...util import DummyTokenizer, registry, load_config_from_str
|
||||
from ...vocab import Vocab
|
||||
|
@ -31,15 +32,24 @@ def create_tokenizer():
|
|||
class KoreanTokenizer(DummyTokenizer):
|
||||
def __init__(self, vocab: Vocab):
|
||||
self.vocab = vocab
|
||||
MeCab = try_mecab_import() # type: ignore[func-returns-value]
|
||||
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
|
||||
self._mecab = try_mecab_import() # type: ignore[func-returns-value]
|
||||
self._mecab_tokenizer = None
|
||||
|
||||
@property
|
||||
def mecab_tokenizer(self):
|
||||
# This is a property so that initializing a pipeline with blank:ko is
|
||||
# possible without actually requiring mecab-ko, e.g. to run
|
||||
# `spacy init vectors ko` for a pipeline that will have a different
|
||||
# tokenizer in the end. The languages need to match for the vectors
|
||||
# to be imported and there's no way to pass a custom config to
|
||||
# `init vectors`.
|
||||
if self._mecab_tokenizer is None:
|
||||
self._mecab_tokenizer = self._mecab("-F%f[0],%f[7]")
|
||||
return self._mecab_tokenizer
|
||||
|
||||
def __reduce__(self):
|
||||
return KoreanTokenizer, (self.vocab,)
|
||||
|
||||
def __del__(self):
|
||||
self.mecab_tokenizer.__del__()
|
||||
|
||||
def __call__(self, text: str) -> Doc:
|
||||
dtokens = list(self.detailed_tokens(text))
|
||||
surfaces = [dt["surface"] for dt in dtokens]
|
||||
|
@ -47,7 +57,10 @@ class KoreanTokenizer(DummyTokenizer):
|
|||
for token, dtoken in zip(doc, dtokens):
|
||||
first_tag, sep, eomi_tags = dtoken["tag"].partition("+")
|
||||
token.tag_ = first_tag # stem(어간) or pre-final(선어말 어미)
|
||||
if token.tag_ in TAG_MAP:
|
||||
token.pos = TAG_MAP[token.tag_][POS]
|
||||
else:
|
||||
token.pos = X
|
||||
token.lemma_ = dtoken["lemma"]
|
||||
doc.user_data["full_tags"] = [dt["tag"] for dt in dtokens]
|
||||
return doc
|
||||
|
@ -76,6 +89,7 @@ class KoreanDefaults(BaseDefaults):
|
|||
lex_attr_getters = LEX_ATTRS
|
||||
stop_words = STOP_WORDS
|
||||
writing_system = {"direction": "ltr", "has_case": False, "has_letters": False}
|
||||
infixes = TOKENIZER_INFIXES
|
||||
|
||||
|
||||
class Korean(Language):
|
||||
|
@ -90,7 +104,8 @@ def try_mecab_import() -> None:
|
|||
return MeCab
|
||||
except ImportError:
|
||||
raise ImportError(
|
||||
"Korean support requires [mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
|
||||
'The Korean tokenizer ("spacy.ko.KoreanTokenizer") requires '
|
||||
"[mecab-ko](https://bitbucket.org/eunjeon/mecab-ko/src/master/README.md), "
|
||||
"[mecab-ko-dic](https://bitbucket.org/eunjeon/mecab-ko-dic), "
|
||||
"and [natto-py](https://github.com/buruzaemon/natto-py)"
|
||||
) from None
|
||||
|
|
12
spacy/lang/ko/punctuation.py
Normal file
12
spacy/lang/ko/punctuation.py
Normal file
|
@ -0,0 +1,12 @@
|
|||
from ..char_classes import LIST_QUOTES
|
||||
from ..punctuation import TOKENIZER_INFIXES as BASE_TOKENIZER_INFIXES
|
||||
|
||||
|
||||
_infixes = (
|
||||
["·", "ㆍ", "\(", "\)"]
|
||||
+ [r"(?<=[0-9])~(?=[0-9-])"]
|
||||
+ LIST_QUOTES
|
||||
+ BASE_TOKENIZER_INFIXES
|
||||
)
|
||||
|
||||
TOKENIZER_INFIXES = _infixes
|
|
@ -1,56 +1,219 @@
|
|||
from ...attrs import LIKE_NUM
|
||||
|
||||
|
||||
_num_words = [
|
||||
"ноль",
|
||||
"один",
|
||||
"два",
|
||||
"три",
|
||||
"четыре",
|
||||
"пять",
|
||||
"шесть",
|
||||
"семь",
|
||||
"восемь",
|
||||
"девять",
|
||||
"десять",
|
||||
"одиннадцать",
|
||||
"двенадцать",
|
||||
"тринадцать",
|
||||
"четырнадцать",
|
||||
"пятнадцать",
|
||||
"шестнадцать",
|
||||
"семнадцать",
|
||||
"восемнадцать",
|
||||
"девятнадцать",
|
||||
"двадцать",
|
||||
"тридцать",
|
||||
"сорок",
|
||||
"пятьдесят",
|
||||
"шестьдесят",
|
||||
"семьдесят",
|
||||
"восемьдесят",
|
||||
"девяносто",
|
||||
"сто",
|
||||
"двести",
|
||||
"триста",
|
||||
"четыреста",
|
||||
"пятьсот",
|
||||
"шестьсот",
|
||||
"семьсот",
|
||||
"восемьсот",
|
||||
"девятьсот",
|
||||
"тысяча",
|
||||
"миллион",
|
||||
"миллиард",
|
||||
"триллион",
|
||||
"квадриллион",
|
||||
"квинтиллион",
|
||||
]
|
||||
_num_words = list(
|
||||
set(
|
||||
"""
|
||||
ноль ноля нолю нолём ноле нулевой нулевого нулевому нулевым нулевом нулевая нулевую нулевое нулевые нулевых нулевыми
|
||||
|
||||
четверть четверти четвертью четвертей четвертям четвертями четвертях
|
||||
|
||||
треть трети третью третей третям третями третях
|
||||
|
||||
половина половины половине половину половиной половин половинам половинами половинах половиною
|
||||
|
||||
один одного одному одним одном
|
||||
первой первого первому первом первый первым первых
|
||||
во-первых
|
||||
единица единицы единице единицу единицей единиц единицам единицами единицах единицею
|
||||
|
||||
два двумя двум двух двоих двое две
|
||||
второго второму второй втором вторым вторых
|
||||
двойка двойки двойке двойку двойкой двоек двойкам двойками двойках двойкою
|
||||
во-вторых
|
||||
оба обе обеим обеими обеих обоим обоими обоих
|
||||
|
||||
полтора полторы полутора
|
||||
|
||||
три третьего третьему третьем третьим третий тремя трем трех трое троих трёх
|
||||
тройка тройки тройке тройку тройкою троек тройкам тройками тройках тройкой
|
||||
троечка троечки троечке троечку троечкой троечек троечкам троечками троечках троечкой
|
||||
трешка трешки трешке трешку трешкой трешек трешкам трешками трешках трешкою
|
||||
трёшка трёшки трёшке трёшку трёшкой трёшек трёшкам трёшками трёшках трёшкою
|
||||
трояк трояка трояку трояком трояке трояки трояков троякам трояками трояках
|
||||
треха треху трехой
|
||||
трёха трёху трёхой
|
||||
втроем втроём
|
||||
|
||||
четыре четвертого четвертому четвертом четвертый четвертым четверка четырьмя четырем четырех четверо четырёх четверым
|
||||
четверых
|
||||
вчетвером
|
||||
|
||||
пять пятого пятому пятом пятый пятым пятью пяти пятеро пятерых пятерыми
|
||||
впятером
|
||||
пятерочка пятерочки пятерочке пятерочками пятерочкой пятерочку пятерочкой пятерочками
|
||||
пятёрочка пятёрочки пятёрочке пятёрочками пятёрочкой пятёрочку пятёрочкой пятёрочками
|
||||
пятерка пятерки пятерке пятерками пятеркой пятерку пятерками
|
||||
пятёрка пятёрки пятёрке пятёрками пятёркой пятёрку пятёрками
|
||||
пятёра пятёры пятёре пятёрами пятёрой пятёру пятёрами
|
||||
пятера пятеры пятере пятерами пятерой пятеру пятерами
|
||||
пятак пятаки пятаке пятаками пятаком пятаку пятаками
|
||||
|
||||
шесть шестерка шестого шестому шестой шестом шестым шестью шести шестеро шестерых
|
||||
вшестером
|
||||
|
||||
семь семерка седьмого седьмому седьмой седьмом седьмым семью семи семеро седьмых
|
||||
всемером
|
||||
|
||||
восемь восьмерка восьмого восьмому восемью восьмой восьмом восьмым восеми восьмером восьми восьмью
|
||||
восьмерых
|
||||
ввосьмером
|
||||
|
||||
девять девятого девятому девятка девятом девятый девятым девятью девяти девятером вдевятером девятерых
|
||||
вдевятером
|
||||
|
||||
десять десятого десятому десятка десятом десятый десятым десятью десяти десятером десятых
|
||||
вдесятером
|
||||
|
||||
одиннадцать одиннадцатого одиннадцатому одиннадцатом одиннадцатый одиннадцатым одиннадцатью одиннадцати
|
||||
одиннадцатых
|
||||
|
||||
двенадцать двенадцатого двенадцатому двенадцатом двенадцатый двенадцатым двенадцатью двенадцати
|
||||
двенадцатых
|
||||
|
||||
тринадцать тринадцатого тринадцатому тринадцатом тринадцатый тринадцатым тринадцатью тринадцати
|
||||
тринадцатых
|
||||
|
||||
четырнадцать четырнадцатого четырнадцатому четырнадцатом четырнадцатый четырнадцатым четырнадцатью четырнадцати
|
||||
четырнадцатых
|
||||
|
||||
пятнадцать пятнадцатого пятнадцатому пятнадцатом пятнадцатый пятнадцатым пятнадцатью пятнадцати
|
||||
пятнадцатых
|
||||
пятнарик пятнарику пятнариком пятнарики
|
||||
|
||||
шестнадцать шестнадцатого шестнадцатому шестнадцатом шестнадцатый шестнадцатым шестнадцатью шестнадцати
|
||||
шестнадцатых
|
||||
|
||||
семнадцать семнадцатого семнадцатому семнадцатом семнадцатый семнадцатым семнадцатью семнадцати семнадцатых
|
||||
|
||||
восемнадцать восемнадцатого восемнадцатому восемнадцатом восемнадцатый восемнадцатым восемнадцатью восемнадцати
|
||||
восемнадцатых
|
||||
|
||||
девятнадцать девятнадцатого девятнадцатому девятнадцатом девятнадцатый девятнадцатым девятнадцатью девятнадцати
|
||||
девятнадцатых
|
||||
|
||||
двадцать двадцатого двадцатому двадцатом двадцатый двадцатым двадцатью двадцати двадцатых
|
||||
|
||||
четвертак четвертака четвертаке четвертаку четвертаки четвертаком четвертаками
|
||||
|
||||
тридцать тридцатого тридцатому тридцатом тридцатый тридцатым тридцатью тридцати тридцатых
|
||||
тридцадка тридцадку тридцадке тридцадки тридцадкой тридцадкою тридцадками
|
||||
|
||||
тридевять тридевяти тридевятью
|
||||
|
||||
сорок сорокового сороковому сороковом сороковым сороковой сороковых
|
||||
сорокет сорокета сорокету сорокете сорокеты сорокетом сорокетами сорокетам
|
||||
|
||||
пятьдесят пятьдесятого пятьдесятому пятьюдесятью пятьдесятом пятьдесятый пятьдесятым пятидесяти пятьдесятых
|
||||
полтинник полтинника полтиннике полтиннику полтинники полтинником полтинниками полтинникам полтинниках
|
||||
пятидесятка пятидесятке пятидесятку пятидесятки пятидесяткой пятидесятками пятидесяткам пятидесятках
|
||||
полтос полтоса полтосе полтосу полтосы полтосом полтосами полтосам полтосах
|
||||
|
||||
шестьдесят шестьдесятого шестьдесятому шестьюдесятью шестьдесятом шестьдесятый шестьдесятым шестидесятые шестидесяти
|
||||
шестьдесятых
|
||||
|
||||
семьдесят семьдесятого семьдесятому семьюдесятью семьдесятом семьдесятый семьдесятым семидесяти семьдесятых
|
||||
|
||||
восемьдесят восемьдесятого восемьдесятому восемьюдесятью восемьдесятом восемьдесятый восемьдесятым восемидесяти
|
||||
восьмидесяти восьмидесятых
|
||||
|
||||
девяносто девяностого девяностому девяностом девяностый девяностым девяноста девяностых
|
||||
|
||||
сто сотого сотому сотом сотен сотый сотым ста
|
||||
стольник стольника стольнику стольнике стольники стольником стольниками
|
||||
сотка сотки сотке соткой сотками соткам сотках
|
||||
сотня сотни сотне сотней сотнями сотням сотнях
|
||||
|
||||
двести двумястами двухсотого двухсотому двухсотом двухсотый двухсотым двумстам двухстах двухсот
|
||||
|
||||
триста тремястами трехсотого трехсотому трехсотом трехсотый трехсотым тремстам трехстах трехсот
|
||||
|
||||
четыреста четырехсотого четырехсотому четырьмястами четырехсотом четырехсотый четырехсотым четыремстам четырехстах
|
||||
четырехсот
|
||||
|
||||
пятьсот пятисотого пятисотому пятьюстами пятисотом пятисотый пятисотым пятистам пятистах пятисот
|
||||
пятисотка пятисотки пятисотке пятисоткой пятисотками пятисоткам пятисоткою пятисотках
|
||||
пятихатка пятихатки пятихатке пятихаткой пятихатками пятихаткам пятихаткою пятихатках
|
||||
пятифан пятифаны пятифане пятифаном пятифанами пятифанах
|
||||
|
||||
шестьсот шестисотого шестисотому шестьюстами шестисотом шестисотый шестисотым шестистам шестистах шестисот
|
||||
|
||||
семьсот семисотого семисотому семьюстами семисотом семисотый семисотым семистам семистах семисот
|
||||
|
||||
восемьсот восемисотого восемисотому восемисотом восемисотый восемисотым восьмистами восьмистам восьмистах восьмисот
|
||||
|
||||
девятьсот девятисотого девятисотому девятьюстами девятисотом девятисотый девятисотым девятистам девятистах девятисот
|
||||
|
||||
тысяча тысячного тысячному тысячном тысячный тысячным тысячам тысячах тысячей тысяч тысячи тыс
|
||||
косарь косаря косару косарем косарями косарях косарям косарей
|
||||
|
||||
десятитысячный десятитысячного десятитысячному десятитысячным десятитысячном десятитысячная десятитысячной
|
||||
десятитысячную десятитысячною десятитысячное десятитысячные десятитысячных десятитысячными
|
||||
|
||||
двадцатитысячный двадцатитысячного двадцатитысячному двадцатитысячным двадцатитысячном двадцатитысячная
|
||||
двадцатитысячной двадцатитысячную двадцатитысячною двадцатитысячное двадцатитысячные двадцатитысячных
|
||||
двадцатитысячными
|
||||
|
||||
тридцатитысячный тридцатитысячного тридцатитысячному тридцатитысячным тридцатитысячном тридцатитысячная
|
||||
тридцатитысячной тридцатитысячную тридцатитысячною тридцатитысячное тридцатитысячные тридцатитысячных
|
||||
тридцатитысячными
|
||||
|
||||
сорокатысячный сорокатысячного сорокатысячному сорокатысячным сорокатысячном сорокатысячная
|
||||
сорокатысячной сорокатысячную сорокатысячною сорокатысячное сорокатысячные сорокатысячных
|
||||
сорокатысячными
|
||||
|
||||
пятидесятитысячный пятидесятитысячного пятидесятитысячному пятидесятитысячным пятидесятитысячном пятидесятитысячная
|
||||
пятидесятитысячной пятидесятитысячную пятидесятитысячною пятидесятитысячное пятидесятитысячные пятидесятитысячных
|
||||
пятидесятитысячными
|
||||
|
||||
шестидесятитысячный шестидесятитысячного шестидесятитысячному шестидесятитысячным шестидесятитысячном шестидесятитысячная
|
||||
шестидесятитысячной шестидесятитысячную шестидесятитысячною шестидесятитысячное шестидесятитысячные шестидесятитысячных
|
||||
шестидесятитысячными
|
||||
|
||||
семидесятитысячный семидесятитысячного семидесятитысячному семидесятитысячным семидесятитысячном семидесятитысячная
|
||||
семидесятитысячной семидесятитысячную семидесятитысячною семидесятитысячное семидесятитысячные семидесятитысячных
|
||||
семидесятитысячными
|
||||
|
||||
восьмидесятитысячный восьмидесятитысячного восьмидесятитысячному восьмидесятитысячным восьмидесятитысячном восьмидесятитысячная
|
||||
восьмидесятитысячной восьмидесятитысячную восьмидесятитысячною восьмидесятитысячное восьмидесятитысячные восьмидесятитысячных
|
||||
восьмидесятитысячными
|
||||
|
||||
стотысячный стотысячного стотысячному стотысячным стотысячном стотысячная стотысячной стотысячную стотысячное
|
||||
стотысячные стотысячных стотысячными стотысячною
|
||||
|
||||
миллион миллионного миллионов миллионному миллионном миллионный миллионным миллионом миллиона миллионе миллиону
|
||||
миллионов
|
||||
лям ляма лямы лямом лямами лямах лямов
|
||||
млн
|
||||
|
||||
десятимиллионная десятимиллионной десятимиллионными десятимиллионный десятимиллионным десятимиллионному
|
||||
десятимиллионными десятимиллионную десятимиллионное десятимиллионные десятимиллионных десятимиллионною
|
||||
|
||||
миллиард миллиардного миллиардному миллиардном миллиардный миллиардным миллиардом миллиарда миллиарде миллиарду
|
||||
миллиардов
|
||||
лярд лярда лярды лярдом лярдами лярдах лярдов
|
||||
млрд
|
||||
|
||||
триллион триллионного триллионному триллионном триллионный триллионным триллионом триллиона триллионе триллиону
|
||||
триллионов трлн
|
||||
|
||||
квадриллион квадриллионного квадриллионному квадриллионный квадриллионным квадриллионом квадриллиона квадриллионе
|
||||
квадриллиону квадриллионов квадрлн
|
||||
|
||||
квинтиллион квинтиллионного квинтиллионному квинтиллионный квинтиллионным квинтиллионом квинтиллиона квинтиллионе
|
||||
квинтиллиону квинтиллионов квинтлн
|
||||
|
||||
i ii iii iv v vi vii viii ix x xi xii xiii xiv xv xvi xvii xviii xix xx xxi xxii xxiii xxiv xxv xxvi xxvii xxvii xxix
|
||||
""".split()
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def like_num(text):
|
||||
if text.startswith(("+", "-", "±", "~")):
|
||||
text = text[1:]
|
||||
if text.endswith("%"):
|
||||
text = text[:-1]
|
||||
text = text.replace(",", "").replace(".", "")
|
||||
if text.isdigit():
|
||||
return True
|
||||
|
|
|
@ -1,52 +1,111 @@
|
|||
STOP_WORDS = set(
|
||||
"""
|
||||
а
|
||||
а авось ага агу аж ай али алло ау ах ая
|
||||
|
||||
будем будет будете будешь буду будут будучи будь будьте бы был была были было
|
||||
быть
|
||||
б будем будет будете будешь буду будут будучи будь будьте бы был была были было
|
||||
быть бац без безусловно бишь благо благодаря ближайшие близко более больше
|
||||
будто бывает бывала бывали бываю бывают бытует
|
||||
|
||||
в вам вами вас весь во вот все всё всего всей всем всём всеми всему всех всею
|
||||
всея всю вся вы
|
||||
всея всю вся вы ваш ваша ваше ваши вдали вдобавок вдруг ведь везде вернее
|
||||
взаимно взаправду видно вишь включая вместо внакладе вначале вне вниз внизу
|
||||
вновь вовсе возможно воистину вокруг вон вообще вопреки вперекор вплоть
|
||||
вполне вправду вправе впрочем впрямь вресноту вроде вряд всегда всюду
|
||||
всякий всякого всякой всячески вчеред
|
||||
|
||||
да для до
|
||||
г го где гораздо гав
|
||||
|
||||
его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею
|
||||
д да для до дабы давайте давно давным даже далее далеко дальше данная
|
||||
данного данное данной данном данному данные данный данных дану данунах
|
||||
даром де действительно довольно доколе доколь долго должен должна
|
||||
должно должны должный дополнительно другая другие другим другими
|
||||
других другое другой
|
||||
|
||||
же
|
||||
е его едим едят ее её ей ел ела ем ему емъ если ест есть ешь еще ещё ею едва
|
||||
ежели еле
|
||||
|
||||
за
|
||||
ж же
|
||||
|
||||
и из или им ими имъ их
|
||||
з за затем зато зачем здесь значит зря
|
||||
|
||||
и из или им ими имъ их ибо иль имеет имел имела имело именно иметь иначе
|
||||
иногда иным иными итак ишь
|
||||
|
||||
й
|
||||
|
||||
к как кем ко когда кого ком кому комья которая которого которое которой котором
|
||||
которому которою которую которые который которым которыми которых кто
|
||||
которому которою которую которые который которым которыми которых кто ка кабы
|
||||
каждая каждое каждые каждый кажется казалась казались казалось казался казаться
|
||||
какая какие каким какими каков какого какой какому какою касательно кой коли
|
||||
коль конечно короче кроме кстати ку куда
|
||||
|
||||
меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего
|
||||
л ли либо лишь любая любого любое любой любом любую любыми любых
|
||||
|
||||
м меня мне мной мною мог моги могите могла могли могло могу могут мое моё моего
|
||||
моей моем моём моему моею можем может можете можешь мои мой моим моими моих
|
||||
мочь мою моя мы
|
||||
мочь мою моя мы мало меж между менее меньше мимо многие много многого многое
|
||||
многом многому можно мол му
|
||||
|
||||
на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим
|
||||
н на нам нами нас наса наш наша наше нашего нашей нашем нашему нашею наши нашим
|
||||
нашими наших нашу не него нее неё ней нем нём нему нет нею ним ними них но
|
||||
наверняка наверху навряд навыворот над надо назад наиболее наизворот
|
||||
наизнанку наипаче накануне наконец наоборот наперед наперекор наподобие
|
||||
например напротив напрямую насилу настоящая настоящее настоящие настоящий
|
||||
насчет нате находиться начала начале неважно негде недавно недалеко незачем
|
||||
некем некогда некому некоторая некоторые некоторый некоторых некто некуда
|
||||
нельзя немногие немногим немного необходимо необходимости необходимые
|
||||
необходимым неоткуда непрерывно нередко несколько нету неужели нечего
|
||||
нечем нечему нечто нешто нибудь нигде ниже низко никак никакой никем
|
||||
никогда никого никому никто никуда ниоткуда нипочем ничего ничем ничему
|
||||
ничто ну нужная нужно нужного нужные нужный нужных ныне нынешнее нынешней
|
||||
нынешних нынче
|
||||
|
||||
о об один одна одни одним одними одних одно одного одной одном одному одною
|
||||
одну он она оне они оно от
|
||||
одну он она оне они оно от оба общую обычно ого однажды однако ой около оный
|
||||
оп опять особенно особо особую особые откуда отнелижа отнелиже отовсюду
|
||||
отсюда оттого оттот оттуда отчего отчему ох очевидно очень ом
|
||||
|
||||
по при
|
||||
п по при паче перед под подавно поди подобная подобно подобного подобные
|
||||
подобный подобным подобных поелику пожалуй пожалуйста позже поистине
|
||||
пока покамест поколе поколь покуда покудова помимо понеже поприще пор
|
||||
пора посему поскольку после посреди посредством потом потому потомушта
|
||||
похожем почему почти поэтому прежде притом причем про просто прочего
|
||||
прочее прочему прочими проще прям пусть
|
||||
|
||||
р ради разве ранее рано раньше рядом
|
||||
|
||||
с сам сама сами самим самими самих само самого самом самому саму свое своё
|
||||
своего своей своем своём своему своею свои свой своим своими своих свою своя
|
||||
себе себя собой собою
|
||||
себе себя собой собою самая самое самой самый самых сверх свыше се сего сей
|
||||
сейчас сие сих сквозь сколько скорее скоро следует слишком смогут сможет
|
||||
сначала снова со собственно совсем сперва спокону спустя сразу среди сродни
|
||||
стал стала стали стало стать суть сызнова
|
||||
|
||||
та так такая такие таким такими таких такого такое такой таком такому такою
|
||||
такую те тебе тебя тем теми тех то тобой тобою того той только том томах тому
|
||||
тот тою ту ты
|
||||
та то ту ты ти так такая такие таким такими таких такого такое такой таком такому такою
|
||||
такую те тебе тебя тем теми тех тобой тобою того той только том томах тому
|
||||
тот тою также таки таков такова там твои твоим твоих твой твоя твоё
|
||||
теперь тогда тоже тотчас точно туда тут тьфу тая
|
||||
|
||||
у уже
|
||||
у уже увы уж ура ух ую
|
||||
|
||||
чего чем чём чему что чтобы
|
||||
ф фу
|
||||
|
||||
эта эти этим этими этих это этого этой этом этому этот этою эту
|
||||
х ха хе хорошо хотел хотела хотелось хотеть хоть хотя хочешь хочу хуже
|
||||
|
||||
я
|
||||
ч чего чем чём чему что чтобы часто чаще чей через чтоб чуть чхать чьим
|
||||
чьих чьё чё
|
||||
|
||||
ш ша
|
||||
|
||||
щ ща щас
|
||||
|
||||
ы ых ые ый
|
||||
|
||||
э эта эти этим этими этих это этого этой этом этому этот этою эту эдак эдакий
|
||||
эй эка экий этак этакий эх
|
||||
|
||||
ю
|
||||
|
||||
я явно явных яко якобы якоже
|
||||
""".split()
|
||||
)
|
||||
|
|
|
@ -2,7 +2,6 @@ from ..tokenizer_exceptions import BASE_EXCEPTIONS
|
|||
from ...symbols import ORTH, NORM
|
||||
from ...util import update_exc
|
||||
|
||||
|
||||
_exc = {}
|
||||
|
||||
_abbrev_exc = [
|
||||
|
@ -42,7 +41,6 @@ _abbrev_exc = [
|
|||
{ORTH: "дек", NORM: "декабрь"},
|
||||
]
|
||||
|
||||
|
||||
for abbrev_desc in _abbrev_exc:
|
||||
abbrev = abbrev_desc[ORTH]
|
||||
for orth in (abbrev, abbrev.capitalize(), abbrev.upper()):
|
||||
|
@ -50,17 +48,354 @@ for abbrev_desc in _abbrev_exc:
|
|||
_exc[orth + "."] = [{ORTH: orth + ".", NORM: abbrev_desc[NORM]}]
|
||||
|
||||
|
||||
_slang_exc = [
|
||||
for abbr in [
|
||||
# Year slang abbreviations
|
||||
{ORTH: "2к15", NORM: "2015"},
|
||||
{ORTH: "2к16", NORM: "2016"},
|
||||
{ORTH: "2к17", NORM: "2017"},
|
||||
{ORTH: "2к18", NORM: "2018"},
|
||||
{ORTH: "2к19", NORM: "2019"},
|
||||
{ORTH: "2к20", NORM: "2020"},
|
||||
]
|
||||
{ORTH: "2к21", NORM: "2021"},
|
||||
{ORTH: "2к22", NORM: "2022"},
|
||||
{ORTH: "2к23", NORM: "2023"},
|
||||
{ORTH: "2к24", NORM: "2024"},
|
||||
{ORTH: "2к25", NORM: "2025"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
for slang_desc in _slang_exc:
|
||||
_exc[slang_desc[ORTH]] = [slang_desc]
|
||||
for abbr in [
|
||||
# Profession and academic titles abbreviations
|
||||
{ORTH: "ак.", NORM: "академик"},
|
||||
{ORTH: "акад.", NORM: "академик"},
|
||||
{ORTH: "д-р архитектуры", NORM: "доктор архитектуры"},
|
||||
{ORTH: "д-р биол. наук", NORM: "доктор биологических наук"},
|
||||
{ORTH: "д-р ветеринар. наук", NORM: "доктор ветеринарных наук"},
|
||||
{ORTH: "д-р воен. наук", NORM: "доктор военных наук"},
|
||||
{ORTH: "д-р геогр. наук", NORM: "доктор географических наук"},
|
||||
{ORTH: "д-р геол.-минерал. наук", NORM: "доктор геолого-минералогических наук"},
|
||||
{ORTH: "д-р искусствоведения", NORM: "доктор искусствоведения"},
|
||||
{ORTH: "д-р ист. наук", NORM: "доктор исторических наук"},
|
||||
{ORTH: "д-р культурологии", NORM: "доктор культурологии"},
|
||||
{ORTH: "д-р мед. наук", NORM: "доктор медицинских наук"},
|
||||
{ORTH: "д-р пед. наук", NORM: "доктор педагогических наук"},
|
||||
{ORTH: "д-р полит. наук", NORM: "доктор политических наук"},
|
||||
{ORTH: "д-р психол. наук", NORM: "доктор психологических наук"},
|
||||
{ORTH: "д-р с.-х. наук", NORM: "доктор сельскохозяйственных наук"},
|
||||
{ORTH: "д-р социол. наук", NORM: "доктор социологических наук"},
|
||||
{ORTH: "д-р техн. наук", NORM: "доктор технических наук"},
|
||||
{ORTH: "д-р фармацевт. наук", NORM: "доктор фармацевтических наук"},
|
||||
{ORTH: "д-р физ.-мат. наук", NORM: "доктор физико-математических наук"},
|
||||
{ORTH: "д-р филол. наук", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д-р филос. наук", NORM: "доктор философских наук"},
|
||||
{ORTH: "д-р хим. наук", NORM: "доктор химических наук"},
|
||||
{ORTH: "д-р экон. наук", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д-р юрид. наук", NORM: "доктор юридических наук"},
|
||||
{ORTH: "д-р", NORM: "доктор"},
|
||||
{ORTH: "д.б.н.", NORM: "доктор биологических наук"},
|
||||
{ORTH: "д.г.-м.н.", NORM: "доктор геолого-минералогических наук"},
|
||||
{ORTH: "д.г.н.", NORM: "доктор географических наук"},
|
||||
{ORTH: "д.и.н.", NORM: "доктор исторических наук"},
|
||||
{ORTH: "д.иск.", NORM: "доктор искусствоведения"},
|
||||
{ORTH: "д.м.н.", NORM: "доктор медицинских наук"},
|
||||
{ORTH: "д.п.н.", NORM: "доктор психологических наук"},
|
||||
{ORTH: "д.пед.н.", NORM: "доктор педагогических наук"},
|
||||
{ORTH: "д.полит.н.", NORM: "доктор политических наук"},
|
||||
{ORTH: "д.с.-х.н.", NORM: "доктор сельскохозяйственных наук"},
|
||||
{ORTH: "д.социол.н.", NORM: "доктор социологических наук"},
|
||||
{ORTH: "д.т.н.", NORM: "доктор технических наук"},
|
||||
{ORTH: "д.т.н", NORM: "доктор технических наук"},
|
||||
{ORTH: "д.ф.-м.н.", NORM: "доктор физико-математических наук"},
|
||||
{ORTH: "д.ф.н.", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д.филос.н.", NORM: "доктор философских наук"},
|
||||
{ORTH: "д.фил.н.", NORM: "доктор филологических наук"},
|
||||
{ORTH: "д.х.н.", NORM: "доктор химических наук"},
|
||||
{ORTH: "д.э.н.", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д.э.н", NORM: "доктор экономических наук"},
|
||||
{ORTH: "д.ю.н.", NORM: "доктор юридических наук"},
|
||||
{ORTH: "доц.", NORM: "доцент"},
|
||||
{ORTH: "и.о.", NORM: "исполняющий обязанности"},
|
||||
{ORTH: "к.б.н.", NORM: "кандидат биологических наук"},
|
||||
{ORTH: "к.воен.н.", NORM: "кандидат военных наук"},
|
||||
{ORTH: "к.г.-м.н.", NORM: "кандидат геолого-минералогических наук"},
|
||||
{ORTH: "к.г.н.", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.геогр.н", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.геогр.наук", NORM: "кандидат географических наук"},
|
||||
{ORTH: "к.и.н.", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "к.иск.", NORM: "кандидат искусствоведения"},
|
||||
{ORTH: "к.м.н.", NORM: "кандидат медицинских наук"},
|
||||
{ORTH: "к.п.н.", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "к.псх.н.", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "к.пед.н.", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "канд.пед.наук", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "к.полит.н.", NORM: "кандидат политических наук"},
|
||||
{ORTH: "к.с.-х.н.", NORM: "кандидат сельскохозяйственных наук"},
|
||||
{ORTH: "к.социол.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.с.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.т.н.", NORM: "кандидат технических наук"},
|
||||
{ORTH: "к.ф.-м.н.", NORM: "кандидат физико-математических наук"},
|
||||
{ORTH: "к.ф.н.", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.фил.н.", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.филол.н", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "к.фарм.наук", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.фарм.н.", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.фарм.н", NORM: "кандидат фармакологических наук"},
|
||||
{ORTH: "к.филос.наук", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.филос.н.", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.филос.н", NORM: "кандидат философских наук"},
|
||||
{ORTH: "к.х.н.", NORM: "кандидат химических наук"},
|
||||
{ORTH: "к.х.н", NORM: "кандидат химических наук"},
|
||||
{ORTH: "к.э.н.", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "к.э.н", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "к.ю.н.", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "к.ю.н", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "канд. архитектуры", NORM: "кандидат архитектуры"},
|
||||
{ORTH: "канд. биол. наук", NORM: "кандидат биологических наук"},
|
||||
{ORTH: "канд. ветеринар. наук", NORM: "кандидат ветеринарных наук"},
|
||||
{ORTH: "канд. воен. наук", NORM: "кандидат военных наук"},
|
||||
{ORTH: "канд. геогр. наук", NORM: "кандидат географических наук"},
|
||||
{ORTH: "канд. геол.-минерал. наук", NORM: "кандидат геолого-минералогических наук"},
|
||||
{ORTH: "канд. искусствоведения", NORM: "кандидат искусствоведения"},
|
||||
{ORTH: "канд. ист. наук", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "к.ист.н.", NORM: "кандидат исторических наук"},
|
||||
{ORTH: "канд. культурологии", NORM: "кандидат культурологии"},
|
||||
{ORTH: "канд. мед. наук", NORM: "кандидат медицинских наук"},
|
||||
{ORTH: "канд. пед. наук", NORM: "кандидат педагогических наук"},
|
||||
{ORTH: "канд. полит. наук", NORM: "кандидат политических наук"},
|
||||
{ORTH: "канд. психол. наук", NORM: "кандидат психологических наук"},
|
||||
{ORTH: "канд. с.-х. наук", NORM: "кандидат сельскохозяйственных наук"},
|
||||
{ORTH: "канд. социол. наук", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.наук", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.н.", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "к.соц.н", NORM: "кандидат социологических наук"},
|
||||
{ORTH: "канд. техн. наук", NORM: "кандидат технических наук"},
|
||||
{ORTH: "канд. фармацевт. наук", NORM: "кандидат фармацевтических наук"},
|
||||
{ORTH: "канд. физ.-мат. наук", NORM: "кандидат физико-математических наук"},
|
||||
{ORTH: "канд. филол. наук", NORM: "кандидат филологических наук"},
|
||||
{ORTH: "канд. филос. наук", NORM: "кандидат философских наук"},
|
||||
{ORTH: "канд. хим. наук", NORM: "кандидат химических наук"},
|
||||
{ORTH: "канд. экон. наук", NORM: "кандидат экономических наук"},
|
||||
{ORTH: "канд. юрид. наук", NORM: "кандидат юридических наук"},
|
||||
{ORTH: "в.н.с.", NORM: "ведущий научный сотрудник"},
|
||||
{ORTH: "мл. науч. сотр.", NORM: "младший научный сотрудник"},
|
||||
{ORTH: "м.н.с.", NORM: "младший научный сотрудник"},
|
||||
{ORTH: "проф.", NORM: "профессор"},
|
||||
{ORTH: "профессор.кафедры", NORM: "профессор кафедры"},
|
||||
{ORTH: "ст. науч. сотр.", NORM: "старший научный сотрудник"},
|
||||
{ORTH: "чл.-к.", NORM: "член корреспондент"},
|
||||
{ORTH: "чл.-корр.", NORM: "член-корреспондент"},
|
||||
{ORTH: "чл.-кор.", NORM: "член-корреспондент"},
|
||||
{ORTH: "дир.", NORM: "директор"},
|
||||
{ORTH: "зам. дир.", NORM: "заместитель директора"},
|
||||
{ORTH: "зав. каф.", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "зав.кафедрой", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "зав. кафедрой", NORM: "заведующий кафедрой"},
|
||||
{ORTH: "асп.", NORM: "аспирант"},
|
||||
{ORTH: "гл. науч. сотр.", NORM: "главный научный сотрудник"},
|
||||
{ORTH: "вед. науч. сотр.", NORM: "ведущий научный сотрудник"},
|
||||
{ORTH: "науч. сотр.", NORM: "научный сотрудник"},
|
||||
{ORTH: "к.м.с.", NORM: "кандидат в мастера спорта"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Literary phrases abbreviations
|
||||
{ORTH: "и т.д.", NORM: "и так далее"},
|
||||
{ORTH: "и т.п.", NORM: "и тому подобное"},
|
||||
{ORTH: "т.д.", NORM: "так далее"},
|
||||
{ORTH: "т.п.", NORM: "тому подобное"},
|
||||
{ORTH: "т.е.", NORM: "то есть"},
|
||||
{ORTH: "т.к.", NORM: "так как"},
|
||||
{ORTH: "в т.ч.", NORM: "в том числе"},
|
||||
{ORTH: "и пр.", NORM: "и прочие"},
|
||||
{ORTH: "и др.", NORM: "и другие"},
|
||||
{ORTH: "т.н.", NORM: "так называемый"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Appeal to a person abbreviations
|
||||
{ORTH: "г-н", NORM: "господин"},
|
||||
{ORTH: "г-да", NORM: "господа"},
|
||||
{ORTH: "г-жа", NORM: "госпожа"},
|
||||
{ORTH: "тов.", NORM: "товарищ"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Time periods abbreviations
|
||||
{ORTH: "до н.э.", NORM: "до нашей эры"},
|
||||
{ORTH: "по н.в.", NORM: "по настоящее время"},
|
||||
{ORTH: "в н.в.", NORM: "в настоящее время"},
|
||||
{ORTH: "наст.", NORM: "настоящий"},
|
||||
{ORTH: "наст. время", NORM: "настоящее время"},
|
||||
{ORTH: "г.г.", NORM: "годы"},
|
||||
{ORTH: "гг.", NORM: "годы"},
|
||||
{ORTH: "т.г.", NORM: "текущий год"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Address forming elements abbreviations
|
||||
{ORTH: "респ.", NORM: "республика"},
|
||||
{ORTH: "обл.", NORM: "область"},
|
||||
{ORTH: "г.ф.з.", NORM: "город федерального значения"},
|
||||
{ORTH: "а.обл.", NORM: "автономная область"},
|
||||
{ORTH: "а.окр.", NORM: "автономный округ"},
|
||||
{ORTH: "м.р-н", NORM: "муниципальный район"},
|
||||
{ORTH: "г.о.", NORM: "городской округ"},
|
||||
{ORTH: "г.п.", NORM: "городское поселение"},
|
||||
{ORTH: "с.п.", NORM: "сельское поселение"},
|
||||
{ORTH: "вн.р-н", NORM: "внутригородской район"},
|
||||
{ORTH: "вн.тер.г.", NORM: "внутригородская территория города"},
|
||||
{ORTH: "пос.", NORM: "поселение"},
|
||||
{ORTH: "р-н", NORM: "район"},
|
||||
{ORTH: "с/с", NORM: "сельсовет"},
|
||||
{ORTH: "г.", NORM: "город"},
|
||||
{ORTH: "п.г.т.", NORM: "поселок городского типа"},
|
||||
{ORTH: "пгт.", NORM: "поселок городского типа"},
|
||||
{ORTH: "р.п.", NORM: "рабочий поселок"},
|
||||
{ORTH: "рп.", NORM: "рабочий поселок"},
|
||||
{ORTH: "кп.", NORM: "курортный поселок"},
|
||||
{ORTH: "гп.", NORM: "городской поселок"},
|
||||
{ORTH: "п.", NORM: "поселок"},
|
||||
{ORTH: "в-ки", NORM: "выселки"},
|
||||
{ORTH: "г-к", NORM: "городок"},
|
||||
{ORTH: "з-ка", NORM: "заимка"},
|
||||
{ORTH: "п-к", NORM: "починок"},
|
||||
{ORTH: "киш.", NORM: "кишлак"},
|
||||
{ORTH: "п. ст. ", NORM: "поселок станция"},
|
||||
{ORTH: "п. ж/д ст. ", NORM: "поселок при железнодорожной станции"},
|
||||
{ORTH: "ж/д бл-ст", NORM: "железнодорожный блокпост"},
|
||||
{ORTH: "ж/д б-ка", NORM: "железнодорожная будка"},
|
||||
{ORTH: "ж/д в-ка", NORM: "железнодорожная ветка"},
|
||||
{ORTH: "ж/д к-ма", NORM: "железнодорожная казарма"},
|
||||
{ORTH: "ж/д к-т", NORM: "железнодорожный комбинат"},
|
||||
{ORTH: "ж/д пл-ма", NORM: "железнодорожная платформа"},
|
||||
{ORTH: "ж/д пл-ка", NORM: "железнодорожная площадка"},
|
||||
{ORTH: "ж/д п.п.", NORM: "железнодорожный путевой пост"},
|
||||
{ORTH: "ж/д о.п.", NORM: "железнодорожный остановочный пункт"},
|
||||
{ORTH: "ж/д рзд.", NORM: "железнодорожный разъезд"},
|
||||
{ORTH: "ж/д ст. ", NORM: "железнодорожная станция"},
|
||||
{ORTH: "м-ко", NORM: "местечко"},
|
||||
{ORTH: "д.", NORM: "деревня"},
|
||||
{ORTH: "с.", NORM: "село"},
|
||||
{ORTH: "сл.", NORM: "слобода"},
|
||||
{ORTH: "ст. ", NORM: "станция"},
|
||||
{ORTH: "ст-ца", NORM: "станица"},
|
||||
{ORTH: "у.", NORM: "улус"},
|
||||
{ORTH: "х.", NORM: "хутор"},
|
||||
{ORTH: "рзд.", NORM: "разъезд"},
|
||||
{ORTH: "зим.", NORM: "зимовье"},
|
||||
{ORTH: "б-г", NORM: "берег"},
|
||||
{ORTH: "ж/р", NORM: "жилой район"},
|
||||
{ORTH: "кв-л", NORM: "квартал"},
|
||||
{ORTH: "мкр.", NORM: "микрорайон"},
|
||||
{ORTH: "ост-в", NORM: "остров"},
|
||||
{ORTH: "платф.", NORM: "платформа"},
|
||||
{ORTH: "п/р", NORM: "промышленный район"},
|
||||
{ORTH: "р-н", NORM: "район"},
|
||||
{ORTH: "тер.", NORM: "территория"},
|
||||
{
|
||||
ORTH: "тер. СНО",
|
||||
NORM: "территория садоводческих некоммерческих объединений граждан",
|
||||
},
|
||||
{
|
||||
ORTH: "тер. ОНО",
|
||||
NORM: "территория огороднических некоммерческих объединений граждан",
|
||||
},
|
||||
{ORTH: "тер. ДНО", NORM: "территория дачных некоммерческих объединений граждан"},
|
||||
{ORTH: "тер. СНТ", NORM: "территория садоводческих некоммерческих товариществ"},
|
||||
{ORTH: "тер. ОНТ", NORM: "территория огороднических некоммерческих товариществ"},
|
||||
{ORTH: "тер. ДНТ", NORM: "территория дачных некоммерческих товариществ"},
|
||||
{ORTH: "тер. СПК", NORM: "территория садоводческих потребительских кооперативов"},
|
||||
{ORTH: "тер. ОПК", NORM: "территория огороднических потребительских кооперативов"},
|
||||
{ORTH: "тер. ДПК", NORM: "территория дачных потребительских кооперативов"},
|
||||
{ORTH: "тер. СНП", NORM: "территория садоводческих некоммерческих партнерств"},
|
||||
{ORTH: "тер. ОНП", NORM: "территория огороднических некоммерческих партнерств"},
|
||||
{ORTH: "тер. ДНП", NORM: "территория дачных некоммерческих партнерств"},
|
||||
{ORTH: "тер. ТСН", NORM: "территория товарищества собственников недвижимости"},
|
||||
{ORTH: "тер. ГСК", NORM: "территория гаражно-строительного кооператива"},
|
||||
{ORTH: "ус.", NORM: "усадьба"},
|
||||
{ORTH: "тер.ф.х.", NORM: "территория фермерского хозяйства"},
|
||||
{ORTH: "ю.", NORM: "юрты"},
|
||||
{ORTH: "ал.", NORM: "аллея"},
|
||||
{ORTH: "б-р", NORM: "бульвар"},
|
||||
{ORTH: "взв.", NORM: "взвоз"},
|
||||
{ORTH: "взд.", NORM: "въезд"},
|
||||
{ORTH: "дор.", NORM: "дорога"},
|
||||
{ORTH: "ззд.", NORM: "заезд"},
|
||||
{ORTH: "км", NORM: "километр"},
|
||||
{ORTH: "к-цо", NORM: "кольцо"},
|
||||
{ORTH: "лн.", NORM: "линия"},
|
||||
{ORTH: "мгстр.", NORM: "магистраль"},
|
||||
{ORTH: "наб.", NORM: "набережная"},
|
||||
{ORTH: "пер-д", NORM: "переезд"},
|
||||
{ORTH: "пер.", NORM: "переулок"},
|
||||
{ORTH: "пл-ка", NORM: "площадка"},
|
||||
{ORTH: "пл.", NORM: "площадь"},
|
||||
{ORTH: "пр-д", NORM: "проезд"},
|
||||
{ORTH: "пр-к", NORM: "просек"},
|
||||
{ORTH: "пр-ка", NORM: "просека"},
|
||||
{ORTH: "пр-лок", NORM: "проселок"},
|
||||
{ORTH: "пр-кт", NORM: "проспект"},
|
||||
{ORTH: "проул.", NORM: "проулок"},
|
||||
{ORTH: "рзд.", NORM: "разъезд"},
|
||||
{ORTH: "ряд", NORM: "ряд(ы)"},
|
||||
{ORTH: "с-р", NORM: "сквер"},
|
||||
{ORTH: "с-к", NORM: "спуск"},
|
||||
{ORTH: "сзд.", NORM: "съезд"},
|
||||
{ORTH: "туп.", NORM: "тупик"},
|
||||
{ORTH: "ул.", NORM: "улица"},
|
||||
{ORTH: "ш.", NORM: "шоссе"},
|
||||
{ORTH: "влд.", NORM: "владение"},
|
||||
{ORTH: "г-ж", NORM: "гараж"},
|
||||
{ORTH: "д.", NORM: "дом"},
|
||||
{ORTH: "двлд.", NORM: "домовладение"},
|
||||
{ORTH: "зд.", NORM: "здание"},
|
||||
{ORTH: "з/у", NORM: "земельный участок"},
|
||||
{ORTH: "кв.", NORM: "квартира"},
|
||||
{ORTH: "ком.", NORM: "комната"},
|
||||
{ORTH: "подв.", NORM: "подвал"},
|
||||
{ORTH: "кот.", NORM: "котельная"},
|
||||
{ORTH: "п-б", NORM: "погреб"},
|
||||
{ORTH: "к.", NORM: "корпус"},
|
||||
{ORTH: "ОНС", NORM: "объект незавершенного строительства"},
|
||||
{ORTH: "оф.", NORM: "офис"},
|
||||
{ORTH: "пав.", NORM: "павильон"},
|
||||
{ORTH: "помещ.", NORM: "помещение"},
|
||||
{ORTH: "раб.уч.", NORM: "рабочий участок"},
|
||||
{ORTH: "скл.", NORM: "склад"},
|
||||
{ORTH: "coop.", NORM: "сооружение"},
|
||||
{ORTH: "стр.", NORM: "строение"},
|
||||
{ORTH: "торг.зал", NORM: "торговый зал"},
|
||||
{ORTH: "а/п", NORM: "аэропорт"},
|
||||
{ORTH: "им.", NORM: "имени"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
for abbr in [
|
||||
# Others abbreviations
|
||||
{ORTH: "тыс.руб.", NORM: "тысяч рублей"},
|
||||
{ORTH: "тыс.", NORM: "тысяч"},
|
||||
{ORTH: "руб.", NORM: "рубль"},
|
||||
{ORTH: "долл.", NORM: "доллар"},
|
||||
{ORTH: "прим.", NORM: "примечание"},
|
||||
{ORTH: "прим.ред.", NORM: "примечание редакции"},
|
||||
{ORTH: "см. также", NORM: "смотри также"},
|
||||
{ORTH: "кв.м.", NORM: "квадрантный метр"},
|
||||
{ORTH: "м2", NORM: "квадрантный метр"},
|
||||
{ORTH: "б/у", NORM: "бывший в употреблении"},
|
||||
{ORTH: "сокр.", NORM: "сокращение"},
|
||||
{ORTH: "чел.", NORM: "человек"},
|
||||
{ORTH: "б.п.", NORM: "базисный пункт"},
|
||||
]:
|
||||
_exc[abbr[ORTH]] = [abbr]
|
||||
|
||||
|
||||
TOKENIZER_EXCEPTIONS = update_exc(BASE_EXCEPTIONS, _exc)
|
||||
|
|
18
spacy/lang/sl/examples.py
Normal file
18
spacy/lang/sl/examples.py
Normal file
|
@ -0,0 +1,18 @@
|
|||
"""
|
||||
Example sentences to test spaCy and its language models.
|
||||
|
||||
>>> from spacy.lang.sl.examples import sentences
|
||||
>>> docs = nlp.pipe(sentences)
|
||||
"""
|
||||
|
||||
|
||||
sentences = [
|
||||
"Apple načrtuje nakup britanskega startupa za 1 bilijon dolarjev",
|
||||
"France Prešeren je umrl 8. februarja 1849 v Kranju",
|
||||
"Staro ljubljansko letališče Moste bo obnovila družba BTC",
|
||||
"London je največje mesto v Združenem kraljestvu.",
|
||||
"Kje se skrivaš?",
|
||||
"Kdo je predsednik Francije?",
|
||||
"Katero je glavno mesto Združenih držav Amerike?",
|
||||
"Kdaj je bil rojen Milan Kučan?",
|
||||
]
|
|
@ -53,7 +53,7 @@ _ordinal_words = [
|
|||
"doksanıncı",
|
||||
"yüzüncü",
|
||||
"bininci",
|
||||
"mliyonuncu",
|
||||
"milyonuncu",
|
||||
"milyarıncı",
|
||||
"trilyonuncu",
|
||||
"katrilyonuncu",
|
||||
|
|
|
@ -2,22 +2,29 @@ from ...attrs import LIKE_NUM
|
|||
|
||||
|
||||
_num_words = [
|
||||
"không",
|
||||
"một",
|
||||
"hai",
|
||||
"ba",
|
||||
"bốn",
|
||||
"năm",
|
||||
"sáu",
|
||||
"bảy",
|
||||
"bẩy",
|
||||
"tám",
|
||||
"chín",
|
||||
"mười",
|
||||
"chục",
|
||||
"trăm",
|
||||
"nghìn",
|
||||
"tỷ",
|
||||
"không", # Zero
|
||||
"một", # One
|
||||
"mốt", # Also one, irreplacable in niché cases for unit digit such as "51"="năm mươi mốt"
|
||||
"hai", # Two
|
||||
"ba", # Three
|
||||
"bốn", # Four
|
||||
"tư", # Also four, used in certain cases for unit digit such as "54"="năm mươi tư"
|
||||
"năm", # Five
|
||||
"lăm", # Also five, irreplacable in niché cases for unit digit such as "55"="năm mươi lăm"
|
||||
"sáu", # Six
|
||||
"bảy", # Seven
|
||||
"bẩy", # Also seven, old fashioned
|
||||
"tám", # Eight
|
||||
"chín", # Nine
|
||||
"mười", # Ten
|
||||
"chục", # Also ten, used for counting in tens such as "20 eggs"="hai chục trứng"
|
||||
"trăm", # Hundred
|
||||
"nghìn", # Thousand
|
||||
"ngàn", # Also thousand, used in the south
|
||||
"vạn", # Ten thousand
|
||||
"triệu", # Million
|
||||
"tỷ", # Billion
|
||||
"tỉ", # Also billion, used in combinatorics such as "tỉ_phú"="billionaire"
|
||||
]
|
||||
|
||||
|
||||
|
|
|
@ -131,7 +131,7 @@ class Language:
|
|||
self,
|
||||
vocab: Union[Vocab, bool] = True,
|
||||
*,
|
||||
max_length: int = 10 ** 6,
|
||||
max_length: int = 10**6,
|
||||
meta: Dict[str, Any] = {},
|
||||
create_tokenizer: Optional[Callable[["Language"], Callable[[str], Doc]]] = None,
|
||||
batch_size: int = 1000,
|
||||
|
@ -1222,8 +1222,9 @@ class Language:
|
|||
component_cfg = {}
|
||||
grads = {}
|
||||
|
||||
def get_grads(W, dW, key=None):
|
||||
def get_grads(key, W, dW):
|
||||
grads[key] = (W, dW)
|
||||
return W, dW
|
||||
|
||||
get_grads.learn_rate = sgd.learn_rate # type: ignore[attr-defined, union-attr]
|
||||
get_grads.b1 = sgd.b1 # type: ignore[attr-defined, union-attr]
|
||||
|
@ -1236,7 +1237,7 @@ class Language:
|
|||
examples, sgd=get_grads, losses=losses, **component_cfg.get(name, {})
|
||||
)
|
||||
for key, (W, dW) in grads.items():
|
||||
sgd(W, dW, key=key) # type: ignore[call-arg, misc]
|
||||
sgd(key, W, dW) # type: ignore[call-arg, misc]
|
||||
return losses
|
||||
|
||||
def begin_training(
|
||||
|
|
|
@ -244,6 +244,10 @@ cdef class Matcher:
|
|||
pipe = "parser"
|
||||
error_msg = Errors.E155.format(pipe=pipe, attr=self.vocab.strings.as_string(attr))
|
||||
raise ValueError(error_msg)
|
||||
|
||||
if self.patterns.empty():
|
||||
matches = []
|
||||
else:
|
||||
matches = find_matches(&self.patterns[0], self.patterns.size(), doclike, length,
|
||||
extensions=self._extensions, predicates=self._extra_predicates, with_alignments=with_alignments)
|
||||
final_matches = []
|
||||
|
@ -686,18 +690,14 @@ cdef int8_t get_is_match(PatternStateC state,
|
|||
return True
|
||||
|
||||
|
||||
cdef int8_t get_is_final(PatternStateC state) nogil:
|
||||
cdef inline int8_t get_is_final(PatternStateC state) nogil:
|
||||
if state.pattern[1].quantifier == FINAL_ID:
|
||||
id_attr = state.pattern[1].attrs[0]
|
||||
if id_attr.attr != ID:
|
||||
with gil:
|
||||
raise ValueError(Errors.E074.format(attr=ID, bad_attr=id_attr.attr))
|
||||
return 1
|
||||
else:
|
||||
return 0
|
||||
|
||||
|
||||
cdef int8_t get_quantifier(PatternStateC state) nogil:
|
||||
cdef inline int8_t get_quantifier(PatternStateC state) nogil:
|
||||
return state.pattern.quantifier
|
||||
|
||||
|
||||
|
|
|
@ -14,7 +14,7 @@ class PhraseMatcher:
|
|||
def add(
|
||||
self,
|
||||
key: str,
|
||||
docs: List[List[Dict[str, Any]]],
|
||||
docs: List[Doc],
|
||||
*,
|
||||
on_match: Optional[
|
||||
Callable[[Matcher, Doc, int, List[Tuple[Any, ...]]], Any]
|
||||
|
|
|
@ -63,4 +63,4 @@ def _get_span_indices(ops, spans: Ragged, lengths: Ints1d) -> Ints1d:
|
|||
|
||||
|
||||
def _ensure_cpu(spans: Ragged, lengths: Ints1d) -> Tuple[Ragged, Ints1d]:
|
||||
return (Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths))
|
||||
return Ragged(to_numpy(spans.dataXd), to_numpy(spans.lengths)), to_numpy(lengths)
|
||||
|
|
|
@ -1,34 +1,82 @@
|
|||
from pathlib import Path
|
||||
from typing import Optional, Callable, Iterable, List
|
||||
from typing import Optional, Callable, Iterable, List, Tuple
|
||||
from thinc.types import Floats2d
|
||||
from thinc.api import chain, clone, list2ragged, reduce_mean, residual
|
||||
from thinc.api import Model, Maxout, Linear
|
||||
from thinc.api import Model, Maxout, Linear, noop, tuplify, Ragged
|
||||
|
||||
from ...util import registry
|
||||
from ...kb import KnowledgeBase, Candidate, get_candidates
|
||||
from ...vocab import Vocab
|
||||
from ...tokens import Span, Doc
|
||||
from ..extract_spans import extract_spans
|
||||
from ...errors import Errors
|
||||
|
||||
|
||||
@registry.architectures("spacy.EntityLinker.v1")
|
||||
@registry.architectures("spacy.EntityLinker.v2")
|
||||
def build_nel_encoder(
|
||||
tok2vec: Model, nO: Optional[int] = None
|
||||
) -> Model[List[Doc], Floats2d]:
|
||||
with Model.define_operators({">>": chain, "**": clone}):
|
||||
with Model.define_operators({">>": chain, "&": tuplify}):
|
||||
token_width = tok2vec.maybe_get_dim("nO")
|
||||
output_layer = Linear(nO=nO, nI=token_width)
|
||||
model = (
|
||||
tok2vec
|
||||
>> list2ragged()
|
||||
((tok2vec >> list2ragged()) & build_span_maker())
|
||||
>> extract_spans()
|
||||
>> reduce_mean()
|
||||
>> residual(Maxout(nO=token_width, nI=token_width, nP=2, dropout=0.0)) # type: ignore[arg-type]
|
||||
>> output_layer
|
||||
)
|
||||
model.set_ref("output_layer", output_layer)
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
# flag to show this isn't legacy
|
||||
model.attrs["include_span_maker"] = True
|
||||
return model
|
||||
|
||||
|
||||
def build_span_maker(n_sents: int = 0) -> Model:
|
||||
model: Model = Model("span_maker", forward=span_maker_forward)
|
||||
model.attrs["n_sents"] = n_sents
|
||||
return model
|
||||
|
||||
|
||||
def span_maker_forward(model, docs: List[Doc], is_train) -> Tuple[Ragged, Callable]:
|
||||
ops = model.ops
|
||||
n_sents = model.attrs["n_sents"]
|
||||
candidates = []
|
||||
for doc in docs:
|
||||
cands = []
|
||||
try:
|
||||
sentences = [s for s in doc.sents]
|
||||
except ValueError:
|
||||
# no sentence info, normal in initialization
|
||||
for tok in doc:
|
||||
tok.is_sent_start = tok.i == 0
|
||||
sentences = [doc[:]]
|
||||
for ent in doc.ents:
|
||||
try:
|
||||
# find the sentence in the list of sentences.
|
||||
sent_index = sentences.index(ent.sent)
|
||||
except AttributeError:
|
||||
# Catch the exception when ent.sent is None and provide a user-friendly warning
|
||||
raise RuntimeError(Errors.E030) from None
|
||||
# get n previous sentences, if there are any
|
||||
start_sentence = max(0, sent_index - n_sents)
|
||||
# get n posterior sentences, or as many < n as there are
|
||||
end_sentence = min(len(sentences) - 1, sent_index + n_sents)
|
||||
# get token positions
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
# save positions for extraction
|
||||
cands.append((start_token, end_token))
|
||||
|
||||
candidates.append(ops.asarray2i(cands))
|
||||
candlens = ops.asarray1i([len(cands) for cands in candidates])
|
||||
candidates = ops.xp.concatenate(candidates)
|
||||
outputs = Ragged(candidates, candlens)
|
||||
# because this is just rearranging docs, the backprop does nothing
|
||||
return outputs, lambda x: []
|
||||
|
||||
|
||||
@registry.misc("spacy.KBFromFile.v1")
|
||||
def load_kb(kb_path: Path) -> Callable[[Vocab], KnowledgeBase]:
|
||||
def kb_from_file(vocab):
|
||||
|
|
|
@ -85,7 +85,7 @@ def get_characters_loss(ops, docs, prediction, nr_char):
|
|||
target = ops.asarray(to_categorical(target_ids, n_classes=256), dtype="f")
|
||||
target = target.reshape((-1, 256 * nr_char))
|
||||
diff = prediction - target
|
||||
loss = (diff ** 2).sum()
|
||||
loss = (diff**2).sum()
|
||||
d_target = diff / float(prediction.shape[0])
|
||||
return loss, d_target
|
||||
|
||||
|
|
|
@ -1,14 +1,14 @@
|
|||
from typing import Optional, List
|
||||
from thinc.api import zero_init, with_array, Softmax, chain, Model
|
||||
from thinc.api import zero_init, with_array, Softmax_v2, chain, Model
|
||||
from thinc.types import Floats2d
|
||||
|
||||
from ...util import registry
|
||||
from ...tokens import Doc
|
||||
|
||||
|
||||
@registry.architectures("spacy.Tagger.v1")
|
||||
@registry.architectures("spacy.Tagger.v2")
|
||||
def build_tagger_model(
|
||||
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None
|
||||
tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None, normalize=False
|
||||
) -> Model[List[Doc], List[Floats2d]]:
|
||||
"""Build a tagger model, using a provided token-to-vector component. The tagger
|
||||
model simply adds a linear layer with softmax activation to predict scores
|
||||
|
@ -19,7 +19,9 @@ def build_tagger_model(
|
|||
"""
|
||||
# TODO: glorot_uniform_init seems to work a bit better than zero_init here?!
|
||||
t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None
|
||||
output_layer = Softmax(nO, t2v_width, init_W=zero_init)
|
||||
output_layer = Softmax_v2(
|
||||
nO, t2v_width, init_W=zero_init, normalize_outputs=normalize
|
||||
)
|
||||
softmax = with_array(output_layer) # type: ignore
|
||||
model = chain(tok2vec, softmax)
|
||||
model.set_ref("tok2vec", tok2vec)
|
||||
|
|
|
@ -11,6 +11,7 @@ import numpy.random
|
|||
from thinc.api import Model, CupyOps, NumpyOps
|
||||
|
||||
from .. import util
|
||||
from ..errors import Errors
|
||||
from ..typedefs cimport weight_t, class_t, hash_t
|
||||
from ..pipeline._parser_internals.stateclass cimport StateClass
|
||||
|
||||
|
@ -411,7 +412,7 @@ cdef class precompute_hiddens:
|
|||
elif name == "nO":
|
||||
return self.nO
|
||||
else:
|
||||
raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP")
|
||||
raise ValueError(Errors.E1033.format(name=name))
|
||||
|
||||
def set_dim(self, name, value):
|
||||
if name == "nF":
|
||||
|
@ -421,7 +422,7 @@ cdef class precompute_hiddens:
|
|||
elif name == "nO":
|
||||
self.nO = value
|
||||
else:
|
||||
raise ValueError(f"Dimension {name} invalid -- only nO, nF, nP")
|
||||
raise ValueError(Errors.E1033.format(name=name))
|
||||
|
||||
def __call__(self, X, bint is_train):
|
||||
if is_train:
|
||||
|
|
|
@ -1,5 +1,6 @@
|
|||
from .attributeruler import AttributeRuler
|
||||
from .dep_parser import DependencyParser
|
||||
from .edit_tree_lemmatizer import EditTreeLemmatizer
|
||||
from .entity_linker import EntityLinker
|
||||
from .ner import EntityRecognizer
|
||||
from .entityruler import EntityRuler
|
||||
|
|
0
spacy/pipeline/_edit_tree_internals/__init__.py
Normal file
0
spacy/pipeline/_edit_tree_internals/__init__.py
Normal file
93
spacy/pipeline/_edit_tree_internals/edit_trees.pxd
Normal file
93
spacy/pipeline/_edit_tree_internals/edit_trees.pxd
Normal file
|
@ -0,0 +1,93 @@
|
|||
from libc.stdint cimport uint32_t, uint64_t
|
||||
from libcpp.unordered_map cimport unordered_map
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from ...typedefs cimport attr_t, hash_t, len_t
|
||||
from ...strings cimport StringStore
|
||||
|
||||
cdef extern from "<algorithm>" namespace "std" nogil:
|
||||
void swap[T](T& a, T& b) except + # Only available in Cython 3.
|
||||
|
||||
# An edit tree (Müller et al., 2015) is a tree structure that consists of
|
||||
# edit operations. The two types of operations are string matches
|
||||
# and string substitutions. Given an input string s and an output string t,
|
||||
# subsitution and match nodes should be interpreted as follows:
|
||||
#
|
||||
# * Substitution node: consists of an original string and substitute string.
|
||||
# If s matches the original string, then t is the substitute. Otherwise,
|
||||
# the node does not apply.
|
||||
# * Match node: consists of a prefix length, suffix length, prefix edit tree,
|
||||
# and suffix edit tree. If s is composed of a prefix, middle part, and suffix
|
||||
# with the given suffix and prefix lengths, then t is the concatenation
|
||||
# prefix_tree(prefix) + middle + suffix_tree(suffix).
|
||||
#
|
||||
# For efficiency, we represent strings in substitution nodes as integers, with
|
||||
# the actual strings stored in a StringStore. Subtrees in match nodes are stored
|
||||
# as tree identifiers (rather than pointers) to simplify serialization.
|
||||
|
||||
cdef uint32_t NULL_TREE_ID
|
||||
|
||||
cdef struct MatchNodeC:
|
||||
len_t prefix_len
|
||||
len_t suffix_len
|
||||
uint32_t prefix_tree
|
||||
uint32_t suffix_tree
|
||||
|
||||
cdef struct SubstNodeC:
|
||||
attr_t orig
|
||||
attr_t subst
|
||||
|
||||
cdef union NodeC:
|
||||
MatchNodeC match_node
|
||||
SubstNodeC subst_node
|
||||
|
||||
cdef struct EditTreeC:
|
||||
bint is_match_node
|
||||
NodeC inner
|
||||
|
||||
cdef inline EditTreeC edittree_new_match(len_t prefix_len, len_t suffix_len,
|
||||
uint32_t prefix_tree, uint32_t suffix_tree):
|
||||
cdef MatchNodeC match_node = MatchNodeC(prefix_len=prefix_len,
|
||||
suffix_len=suffix_len, prefix_tree=prefix_tree,
|
||||
suffix_tree=suffix_tree)
|
||||
cdef NodeC inner = NodeC(match_node=match_node)
|
||||
return EditTreeC(is_match_node=True, inner=inner)
|
||||
|
||||
cdef inline EditTreeC edittree_new_subst(attr_t orig, attr_t subst):
|
||||
cdef EditTreeC node
|
||||
cdef SubstNodeC subst_node = SubstNodeC(orig=orig, subst=subst)
|
||||
cdef NodeC inner = NodeC(subst_node=subst_node)
|
||||
return EditTreeC(is_match_node=False, inner=inner)
|
||||
|
||||
cdef inline uint64_t edittree_hash(EditTreeC tree):
|
||||
cdef MatchNodeC match_node
|
||||
cdef SubstNodeC subst_node
|
||||
|
||||
if tree.is_match_node:
|
||||
match_node = tree.inner.match_node
|
||||
return hash((match_node.prefix_len, match_node.suffix_len, match_node.prefix_tree, match_node.suffix_tree))
|
||||
else:
|
||||
subst_node = tree.inner.subst_node
|
||||
return hash((subst_node.orig, subst_node.subst))
|
||||
|
||||
cdef struct LCS:
|
||||
int source_begin
|
||||
int source_end
|
||||
int target_begin
|
||||
int target_end
|
||||
|
||||
cdef inline bint lcs_is_empty(LCS lcs):
|
||||
return lcs.source_begin == 0 and lcs.source_end == 0 and lcs.target_begin == 0 and lcs.target_end == 0
|
||||
|
||||
cdef class EditTrees:
|
||||
cdef vector[EditTreeC] trees
|
||||
cdef unordered_map[hash_t, uint32_t] map
|
||||
cdef StringStore strings
|
||||
|
||||
cpdef uint32_t add(self, str form, str lemma)
|
||||
cpdef str apply(self, uint32_t tree_id, str form)
|
||||
cpdef unicode tree_to_str(self, uint32_t tree_id)
|
||||
|
||||
cdef uint32_t _add(self, str form, str lemma)
|
||||
cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces)
|
||||
cdef uint32_t _tree_id(self, EditTreeC tree)
|
305
spacy/pipeline/_edit_tree_internals/edit_trees.pyx
Normal file
305
spacy/pipeline/_edit_tree_internals/edit_trees.pyx
Normal file
|
@ -0,0 +1,305 @@
|
|||
# cython: infer_types=True, binding=True
|
||||
from cython.operator cimport dereference as deref
|
||||
from libc.stdint cimport uint32_t
|
||||
from libc.stdint cimport UINT32_MAX
|
||||
from libc.string cimport memset
|
||||
from libcpp.pair cimport pair
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
from ...typedefs cimport hash_t
|
||||
|
||||
from ... import util
|
||||
from ...errors import Errors
|
||||
from ...strings import StringStore
|
||||
from .schemas import validate_edit_tree
|
||||
|
||||
|
||||
NULL_TREE_ID = UINT32_MAX
|
||||
|
||||
cdef LCS find_lcs(str source, str target):
|
||||
"""
|
||||
Find the longest common subsequence (LCS) between two strings. If there are
|
||||
multiple LCSes, only one of them is returned.
|
||||
|
||||
source (str): The first string.
|
||||
target (str): The second string.
|
||||
RETURNS (LCS): The spans of the longest common subsequences.
|
||||
"""
|
||||
cdef Py_ssize_t source_len = len(source)
|
||||
cdef Py_ssize_t target_len = len(target)
|
||||
cdef size_t longest_align = 0;
|
||||
cdef int source_idx, target_idx
|
||||
cdef LCS lcs
|
||||
cdef Py_UCS4 source_cp, target_cp
|
||||
|
||||
memset(&lcs, 0, sizeof(lcs))
|
||||
|
||||
cdef vector[size_t] prev_aligns = vector[size_t](target_len);
|
||||
cdef vector[size_t] cur_aligns = vector[size_t](target_len);
|
||||
|
||||
for (source_idx, source_cp) in enumerate(source):
|
||||
for (target_idx, target_cp) in enumerate(target):
|
||||
if source_cp == target_cp:
|
||||
if source_idx == 0 or target_idx == 0:
|
||||
cur_aligns[target_idx] = 1
|
||||
else:
|
||||
cur_aligns[target_idx] = prev_aligns[target_idx - 1] + 1
|
||||
|
||||
# Check if this is the longest alignment and replace previous
|
||||
# best alignment when this is the case.
|
||||
if cur_aligns[target_idx] > longest_align:
|
||||
longest_align = cur_aligns[target_idx]
|
||||
lcs.source_begin = source_idx - longest_align + 1
|
||||
lcs.source_end = source_idx + 1
|
||||
lcs.target_begin = target_idx - longest_align + 1
|
||||
lcs.target_end = target_idx + 1
|
||||
else:
|
||||
# No match, we start with a zero-length alignment.
|
||||
cur_aligns[target_idx] = 0
|
||||
swap(prev_aligns, cur_aligns)
|
||||
|
||||
return lcs
|
||||
|
||||
cdef class EditTrees:
|
||||
"""Container for constructing and storing edit trees."""
|
||||
def __init__(self, strings: StringStore):
|
||||
"""Create a container for edit trees.
|
||||
|
||||
strings (StringStore): the string store to use."""
|
||||
self.strings = strings
|
||||
|
||||
cpdef uint32_t add(self, str form, str lemma):
|
||||
"""Add an edit tree that rewrites the given string into the given lemma.
|
||||
|
||||
RETURNS (int): identifier of the edit tree in the container.
|
||||
"""
|
||||
# Treat two empty strings as a special case. Generating an edit
|
||||
# tree for identical strings results in a match node. However,
|
||||
# since two empty strings have a zero-length LCS, a substitution
|
||||
# node would be created. Since we do not want to clutter the
|
||||
# recursive tree construction with logic for this case, handle
|
||||
# it in this wrapper method.
|
||||
if len(form) == 0 and len(lemma) == 0:
|
||||
tree = edittree_new_match(0, 0, NULL_TREE_ID, NULL_TREE_ID)
|
||||
return self._tree_id(tree)
|
||||
|
||||
return self._add(form, lemma)
|
||||
|
||||
cdef uint32_t _add(self, str form, str lemma):
|
||||
cdef LCS lcs = find_lcs(form, lemma)
|
||||
|
||||
cdef EditTreeC tree
|
||||
cdef uint32_t tree_id, prefix_tree, suffix_tree
|
||||
if lcs_is_empty(lcs):
|
||||
tree = edittree_new_subst(self.strings.add(form), self.strings.add(lemma))
|
||||
else:
|
||||
# If we have a non-empty LCS, such as "gooi" in "ge[gooi]d" and "[gooi]en",
|
||||
# create edit trees for the prefix pair ("ge"/"") and the suffix pair ("d"/"en").
|
||||
prefix_tree = NULL_TREE_ID
|
||||
if lcs.source_begin != 0 or lcs.target_begin != 0:
|
||||
prefix_tree = self.add(form[:lcs.source_begin], lemma[:lcs.target_begin])
|
||||
|
||||
suffix_tree = NULL_TREE_ID
|
||||
if lcs.source_end != len(form) or lcs.target_end != len(lemma):
|
||||
suffix_tree = self.add(form[lcs.source_end:], lemma[lcs.target_end:])
|
||||
|
||||
tree = edittree_new_match(lcs.source_begin, len(form) - lcs.source_end, prefix_tree, suffix_tree)
|
||||
|
||||
return self._tree_id(tree)
|
||||
|
||||
cdef uint32_t _tree_id(self, EditTreeC tree):
|
||||
# If this tree has been constructed before, return its identifier.
|
||||
cdef hash_t hash = edittree_hash(tree)
|
||||
cdef unordered_map[hash_t, uint32_t].iterator iter = self.map.find(hash)
|
||||
if iter != self.map.end():
|
||||
return deref(iter).second
|
||||
|
||||
# The tree hasn't been seen before, store it.
|
||||
cdef uint32_t tree_id = self.trees.size()
|
||||
self.trees.push_back(tree)
|
||||
self.map.insert(pair[hash_t, uint32_t](hash, tree_id))
|
||||
|
||||
return tree_id
|
||||
|
||||
cpdef str apply(self, uint32_t tree_id, str form):
|
||||
"""Apply an edit tree to a form.
|
||||
|
||||
tree_id (uint32_t): the identifier of the edit tree to apply.
|
||||
form (str): the form to apply the edit tree to.
|
||||
RETURNS (str): the transformer form or None if the edit tree
|
||||
could not be applied to the form.
|
||||
"""
|
||||
if tree_id >= self.trees.size():
|
||||
raise IndexError(Errors.E1030)
|
||||
|
||||
lemma_pieces = []
|
||||
try:
|
||||
self._apply(tree_id, form, lemma_pieces)
|
||||
except ValueError:
|
||||
return None
|
||||
return "".join(lemma_pieces)
|
||||
|
||||
cdef _apply(self, uint32_t tree_id, str form_part, list lemma_pieces):
|
||||
"""Recursively apply an edit tree to a form, adding pieces to
|
||||
the lemma_pieces list."""
|
||||
assert tree_id <= self.trees.size()
|
||||
|
||||
cdef EditTreeC tree = self.trees[tree_id]
|
||||
cdef MatchNodeC match_node
|
||||
cdef int suffix_start
|
||||
|
||||
if tree.is_match_node:
|
||||
match_node = tree.inner.match_node
|
||||
|
||||
if match_node.prefix_len + match_node.suffix_len > len(form_part):
|
||||
raise ValueError(Errors.E1029)
|
||||
|
||||
suffix_start = len(form_part) - match_node.suffix_len
|
||||
|
||||
if match_node.prefix_tree != NULL_TREE_ID:
|
||||
self._apply(match_node.prefix_tree, form_part[:match_node.prefix_len], lemma_pieces)
|
||||
|
||||
lemma_pieces.append(form_part[match_node.prefix_len:suffix_start])
|
||||
|
||||
if match_node.suffix_tree != NULL_TREE_ID:
|
||||
self._apply(match_node.suffix_tree, form_part[suffix_start:], lemma_pieces)
|
||||
else:
|
||||
if form_part == self.strings[tree.inner.subst_node.orig]:
|
||||
lemma_pieces.append(self.strings[tree.inner.subst_node.subst])
|
||||
else:
|
||||
raise ValueError(Errors.E1029)
|
||||
|
||||
cpdef unicode tree_to_str(self, uint32_t tree_id):
|
||||
"""Return the tree as a string. The tree tree string is formatted
|
||||
like an S-expression. This is primarily useful for debugging. Match
|
||||
nodes have the following format:
|
||||
|
||||
(m prefix_len suffix_len prefix_tree suffix_tree)
|
||||
|
||||
Substitution nodes have the following format:
|
||||
|
||||
(s original substitute)
|
||||
|
||||
tree_id (uint32_t): the identifier of the edit tree.
|
||||
RETURNS (str): the tree as an S-expression.
|
||||
"""
|
||||
|
||||
if tree_id >= self.trees.size():
|
||||
raise IndexError(Errors.E1030)
|
||||
|
||||
cdef EditTreeC tree = self.trees[tree_id]
|
||||
cdef SubstNodeC subst_node
|
||||
|
||||
if not tree.is_match_node:
|
||||
subst_node = tree.inner.subst_node
|
||||
return f"(s '{self.strings[subst_node.orig]}' '{self.strings[subst_node.subst]}')"
|
||||
|
||||
cdef MatchNodeC match_node = tree.inner.match_node
|
||||
|
||||
prefix_tree = "()"
|
||||
if match_node.prefix_tree != NULL_TREE_ID:
|
||||
prefix_tree = self.tree_to_str(match_node.prefix_tree)
|
||||
|
||||
suffix_tree = "()"
|
||||
if match_node.suffix_tree != NULL_TREE_ID:
|
||||
suffix_tree = self.tree_to_str(match_node.suffix_tree)
|
||||
|
||||
return f"(m {match_node.prefix_len} {match_node.suffix_len} {prefix_tree} {suffix_tree})"
|
||||
|
||||
def from_json(self, trees: list) -> "EditTrees":
|
||||
self.trees.clear()
|
||||
|
||||
for tree in trees:
|
||||
tree = _dict2tree(tree)
|
||||
self.trees.push_back(tree)
|
||||
|
||||
self._rebuild_tree_map()
|
||||
|
||||
def from_bytes(self, bytes_data: bytes, *) -> "EditTrees":
|
||||
def deserialize_trees(tree_dicts):
|
||||
cdef EditTreeC c_tree
|
||||
for tree_dict in tree_dicts:
|
||||
c_tree = _dict2tree(tree_dict)
|
||||
self.trees.push_back(c_tree)
|
||||
|
||||
deserializers = {}
|
||||
deserializers["trees"] = lambda n: deserialize_trees(n)
|
||||
util.from_bytes(bytes_data, deserializers, [])
|
||||
|
||||
self._rebuild_tree_map()
|
||||
|
||||
return self
|
||||
|
||||
def to_bytes(self, **kwargs) -> bytes:
|
||||
tree_dicts = []
|
||||
for tree in self.trees:
|
||||
tree = _tree2dict(tree)
|
||||
tree_dicts.append(tree)
|
||||
|
||||
serializers = {}
|
||||
serializers["trees"] = lambda: tree_dicts
|
||||
|
||||
return util.to_bytes(serializers, [])
|
||||
|
||||
def to_disk(self, path, **kwargs) -> "EditTrees":
|
||||
path = util.ensure_path(path)
|
||||
with path.open("wb") as file_:
|
||||
file_.write(self.to_bytes())
|
||||
|
||||
def from_disk(self, path, **kwargs) -> "EditTrees":
|
||||
path = util.ensure_path(path)
|
||||
if path.exists():
|
||||
with path.open("rb") as file_:
|
||||
data = file_.read()
|
||||
return self.from_bytes(data)
|
||||
|
||||
return self
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return _tree2dict(self.trees[idx])
|
||||
|
||||
def __len__(self):
|
||||
return self.trees.size()
|
||||
|
||||
def _rebuild_tree_map(self):
|
||||
"""Rebuild the tree hash -> tree id mapping"""
|
||||
cdef EditTreeC c_tree
|
||||
cdef uint32_t tree_id
|
||||
cdef hash_t tree_hash
|
||||
|
||||
self.map.clear()
|
||||
|
||||
for tree_id in range(self.trees.size()):
|
||||
c_tree = self.trees[tree_id]
|
||||
tree_hash = edittree_hash(c_tree)
|
||||
self.map.insert(pair[hash_t, uint32_t](tree_hash, tree_id))
|
||||
|
||||
def __reduce__(self):
|
||||
return (unpickle_edittrees, (self.strings, self.to_bytes()))
|
||||
|
||||
|
||||
def unpickle_edittrees(strings, trees_data):
|
||||
return EditTrees(strings).from_bytes(trees_data)
|
||||
|
||||
|
||||
def _tree2dict(tree):
|
||||
if tree["is_match_node"]:
|
||||
tree = tree["inner"]["match_node"]
|
||||
else:
|
||||
tree = tree["inner"]["subst_node"]
|
||||
return(dict(tree))
|
||||
|
||||
def _dict2tree(tree):
|
||||
errors = validate_edit_tree(tree)
|
||||
if errors:
|
||||
raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
|
||||
|
||||
tree = dict(tree)
|
||||
if "prefix_len" in tree:
|
||||
tree = {"is_match_node": True, "inner": {"match_node": tree}}
|
||||
else:
|
||||
tree = {"is_match_node": False, "inner": {"subst_node": tree}}
|
||||
|
||||
return tree
|
44
spacy/pipeline/_edit_tree_internals/schemas.py
Normal file
44
spacy/pipeline/_edit_tree_internals/schemas.py
Normal file
|
@ -0,0 +1,44 @@
|
|||
from typing import Any, Dict, List, Union
|
||||
from collections import defaultdict
|
||||
from pydantic import BaseModel, Field, ValidationError
|
||||
from pydantic.types import StrictBool, StrictInt, StrictStr
|
||||
|
||||
|
||||
class MatchNodeSchema(BaseModel):
|
||||
prefix_len: StrictInt = Field(..., title="Prefix length")
|
||||
suffix_len: StrictInt = Field(..., title="Suffix length")
|
||||
prefix_tree: StrictInt = Field(..., title="Prefix tree")
|
||||
suffix_tree: StrictInt = Field(..., title="Suffix tree")
|
||||
|
||||
class Config:
|
||||
extra = "forbid"
|
||||
|
||||
|
||||
class SubstNodeSchema(BaseModel):
|
||||
orig: Union[int, StrictStr] = Field(..., title="Original substring")
|
||||
subst: Union[int, StrictStr] = Field(..., title="Replacement substring")
|
||||
|
||||
class Config:
|
||||
extra = "forbid"
|
||||
|
||||
|
||||
class EditTreeSchema(BaseModel):
|
||||
__root__: Union[MatchNodeSchema, SubstNodeSchema]
|
||||
|
||||
|
||||
def validate_edit_tree(obj: Dict[str, Any]) -> List[str]:
|
||||
"""Validate edit tree.
|
||||
|
||||
obj (Dict[str, Any]): JSON-serializable data to validate.
|
||||
RETURNS (List[str]): A list of error messages, if available.
|
||||
"""
|
||||
try:
|
||||
EditTreeSchema.parse_obj(obj)
|
||||
return []
|
||||
except ValidationError as e:
|
||||
errors = e.errors()
|
||||
data = defaultdict(list)
|
||||
for error in errors:
|
||||
err_loc = " -> ".join([str(p) for p in error.get("loc", [])])
|
||||
data[err_loc].append(error.get("msg"))
|
||||
return [f"[{loc}] {', '.join(msg)}" for loc, msg in data.items()] # type: ignore[arg-type]
|
|
@ -218,7 +218,7 @@ def _get_aligned_sent_starts(example):
|
|||
sent_starts = [False] * len(example.x)
|
||||
seen_words = set()
|
||||
for y_sent in example.y.sents:
|
||||
x_indices = list(align[y_sent.start : y_sent.end].dataXd)
|
||||
x_indices = list(align[y_sent.start : y_sent.end])
|
||||
if any(x_idx in seen_words for x_idx in x_indices):
|
||||
# If there are any tokens in X that align across two sentences,
|
||||
# regard the sentence annotations as missing, as we can't
|
||||
|
@ -824,7 +824,7 @@ cdef class ArcEager(TransitionSystem):
|
|||
for i in range(self.n_moves):
|
||||
print(self.get_class_name(i), is_valid[i], costs[i])
|
||||
print("Gold sent starts?", is_sent_start(&gold_state, state.B(0)), is_sent_start(&gold_state, state.B(1)))
|
||||
raise ValueError("Could not find gold transition - see logs above.")
|
||||
raise ValueError(Errors.E1031)
|
||||
|
||||
def get_oracle_sequence_from_state(self, StateClass state, ArcEagerGold gold, _debug=None):
|
||||
cdef int i
|
||||
|
|
|
@ -4,6 +4,10 @@ for doing pseudo-projective parsing implementation uses the HEAD decoration
|
|||
scheme.
|
||||
"""
|
||||
from copy import copy
|
||||
from libc.limits cimport INT_MAX
|
||||
from libc.stdlib cimport abs
|
||||
from libcpp cimport bool
|
||||
from libcpp.vector cimport vector
|
||||
|
||||
from ...tokens.doc cimport Doc, set_children_from_heads
|
||||
|
||||
|
@ -41,13 +45,18 @@ def contains_cycle(heads):
|
|||
|
||||
|
||||
def is_nonproj_arc(tokenid, heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
return _is_nonproj_arc(tokenid, c_heads)
|
||||
|
||||
|
||||
cdef bool _is_nonproj_arc(int tokenid, const vector[int]& heads) nogil:
|
||||
# definition (e.g. Havelka 2007): an arc h -> d, h < d is non-projective
|
||||
# if there is a token k, h < k < d such that h is not
|
||||
# an ancestor of k. Same for h -> d, h > d
|
||||
head = heads[tokenid]
|
||||
if head == tokenid: # root arcs cannot be non-projective
|
||||
return False
|
||||
elif head is None: # unattached tokens cannot be non-projective
|
||||
elif head < 0: # unattached tokens cannot be non-projective
|
||||
return False
|
||||
|
||||
cdef int start, end
|
||||
|
@ -56,19 +65,29 @@ def is_nonproj_arc(tokenid, heads):
|
|||
else:
|
||||
start, end = (tokenid+1, head)
|
||||
for k in range(start, end):
|
||||
for ancestor in ancestors(k, heads):
|
||||
if ancestor is None: # for unattached tokens/subtrees
|
||||
break
|
||||
elif ancestor == head: # normal case: k dominated by h
|
||||
break
|
||||
if _has_head_as_ancestor(k, head, heads):
|
||||
continue
|
||||
else: # head not in ancestors: d -> h is non-projective
|
||||
return True
|
||||
return False
|
||||
|
||||
|
||||
cdef bool _has_head_as_ancestor(int tokenid, int head, const vector[int]& heads) nogil:
|
||||
ancestor = tokenid
|
||||
cnt = 0
|
||||
while cnt < heads.size():
|
||||
if heads[ancestor] == head or heads[ancestor] < 0:
|
||||
return True
|
||||
ancestor = heads[ancestor]
|
||||
cnt += 1
|
||||
|
||||
return False
|
||||
|
||||
|
||||
def is_nonproj_tree(heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
# a tree is non-projective if at least one arc is non-projective
|
||||
return any(is_nonproj_arc(word, heads) for word in range(len(heads)))
|
||||
return any(_is_nonproj_arc(word, c_heads) for word in range(len(heads)))
|
||||
|
||||
|
||||
def decompose(label):
|
||||
|
@ -98,16 +117,31 @@ def projectivize(heads, labels):
|
|||
# tree, i.e. connected and cycle-free. Returns a new pair (heads, labels)
|
||||
# which encode a projective and decorated tree.
|
||||
proj_heads = copy(heads)
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
|
||||
if smallest_np_arc is None: # this sentence is already projective
|
||||
|
||||
cdef int new_head
|
||||
cdef vector[int] c_proj_heads = _heads_to_c(proj_heads)
|
||||
cdef int smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
|
||||
if smallest_np_arc == -1: # this sentence is already projective
|
||||
return proj_heads, copy(labels)
|
||||
while smallest_np_arc is not None:
|
||||
_lift(smallest_np_arc, proj_heads)
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(proj_heads)
|
||||
while smallest_np_arc != -1:
|
||||
new_head = _lift(smallest_np_arc, proj_heads)
|
||||
c_proj_heads[smallest_np_arc] = new_head
|
||||
smallest_np_arc = _get_smallest_nonproj_arc(c_proj_heads)
|
||||
deco_labels = _decorate(heads, proj_heads, labels)
|
||||
return proj_heads, deco_labels
|
||||
|
||||
|
||||
cdef vector[int] _heads_to_c(heads):
|
||||
cdef vector[int] c_heads;
|
||||
for head in heads:
|
||||
if head == None:
|
||||
c_heads.push_back(-1)
|
||||
else:
|
||||
assert head < len(heads)
|
||||
c_heads.push_back(head)
|
||||
return c_heads
|
||||
|
||||
|
||||
cpdef deprojectivize(Doc doc):
|
||||
# Reattach arcs with decorated labels (following HEAD scheme). For each
|
||||
# decorated arc X||Y, search top-down, left-to-right, breadth-first until
|
||||
|
@ -137,27 +171,38 @@ def _decorate(heads, proj_heads, labels):
|
|||
deco_labels.append(labels[tokenid])
|
||||
return deco_labels
|
||||
|
||||
def get_smallest_nonproj_arc_slow(heads):
|
||||
cdef vector[int] c_heads = _heads_to_c(heads)
|
||||
return _get_smallest_nonproj_arc(c_heads)
|
||||
|
||||
def _get_smallest_nonproj_arc(heads):
|
||||
|
||||
cdef int _get_smallest_nonproj_arc(const vector[int]& heads) nogil:
|
||||
# return the smallest non-proj arc or None
|
||||
# where size is defined as the distance between dep and head
|
||||
# and ties are broken left to right
|
||||
smallest_size = float('inf')
|
||||
smallest_np_arc = None
|
||||
for tokenid, head in enumerate(heads):
|
||||
cdef int smallest_size = INT_MAX
|
||||
cdef int smallest_np_arc = -1
|
||||
cdef int size
|
||||
cdef int tokenid
|
||||
cdef int head
|
||||
|
||||
for tokenid in range(heads.size()):
|
||||
head = heads[tokenid]
|
||||
size = abs(tokenid-head)
|
||||
if size < smallest_size and is_nonproj_arc(tokenid, heads):
|
||||
if size < smallest_size and _is_nonproj_arc(tokenid, heads):
|
||||
smallest_size = size
|
||||
smallest_np_arc = tokenid
|
||||
return smallest_np_arc
|
||||
|
||||
|
||||
def _lift(tokenid, heads):
|
||||
cpdef int _lift(tokenid, heads):
|
||||
# reattaches a word to it's grandfather
|
||||
head = heads[tokenid]
|
||||
ghead = heads[head]
|
||||
cdef int new_head = ghead if head != ghead else tokenid
|
||||
# attach to ghead if head isn't attached to root else attach to root
|
||||
heads[tokenid] = ghead if head != ghead else tokenid
|
||||
heads[tokenid] = new_head
|
||||
return new_head
|
||||
|
||||
|
||||
def _find_new_head(token, headlabel):
|
||||
|
|
379
spacy/pipeline/edit_tree_lemmatizer.py
Normal file
379
spacy/pipeline/edit_tree_lemmatizer.py
Normal file
|
@ -0,0 +1,379 @@
|
|||
from typing import cast, Any, Callable, Dict, Iterable, List, Optional
|
||||
from typing import Sequence, Tuple, Union
|
||||
from collections import Counter
|
||||
from copy import deepcopy
|
||||
from itertools import islice
|
||||
import numpy as np
|
||||
|
||||
import srsly
|
||||
from thinc.api import Config, Model, SequenceCategoricalCrossentropy
|
||||
from thinc.types import Floats2d, Ints1d, Ints2d
|
||||
|
||||
from ._edit_tree_internals.edit_trees import EditTrees
|
||||
from ._edit_tree_internals.schemas import validate_edit_tree
|
||||
from .lemmatizer import lemmatizer_score
|
||||
from .trainable_pipe import TrainablePipe
|
||||
from ..errors import Errors
|
||||
from ..language import Language
|
||||
from ..tokens import Doc
|
||||
from ..training import Example, validate_examples, validate_get_examples
|
||||
from ..vocab import Vocab
|
||||
from .. import util
|
||||
|
||||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
pretrained_vectors = null
|
||||
width = 96
|
||||
depth = 4
|
||||
embed_size = 2000
|
||||
window_size = 1
|
||||
maxout_pieces = 3
|
||||
subword_features = true
|
||||
"""
|
||||
DEFAULT_EDIT_TREE_LEMMATIZER_MODEL = Config().from_str(default_model_config)["model"]
|
||||
|
||||
|
||||
@Language.factory(
|
||||
"trainable_lemmatizer",
|
||||
assigns=["token.lemma"],
|
||||
requires=[],
|
||||
default_config={
|
||||
"model": DEFAULT_EDIT_TREE_LEMMATIZER_MODEL,
|
||||
"backoff": "orth",
|
||||
"min_tree_freq": 3,
|
||||
"overwrite": False,
|
||||
"top_k": 1,
|
||||
"scorer": {"@scorers": "spacy.lemmatizer_scorer.v1"},
|
||||
},
|
||||
default_score_weights={"lemma_acc": 1.0},
|
||||
)
|
||||
def make_edit_tree_lemmatizer(
|
||||
nlp: Language,
|
||||
name: str,
|
||||
model: Model,
|
||||
backoff: Optional[str],
|
||||
min_tree_freq: int,
|
||||
overwrite: bool,
|
||||
top_k: int,
|
||||
scorer: Optional[Callable],
|
||||
):
|
||||
"""Construct an EditTreeLemmatizer component."""
|
||||
return EditTreeLemmatizer(
|
||||
nlp.vocab,
|
||||
model,
|
||||
name,
|
||||
backoff=backoff,
|
||||
min_tree_freq=min_tree_freq,
|
||||
overwrite=overwrite,
|
||||
top_k=top_k,
|
||||
scorer=scorer,
|
||||
)
|
||||
|
||||
|
||||
class EditTreeLemmatizer(TrainablePipe):
|
||||
"""
|
||||
Lemmatizer that lemmatizes each word using a predicted edit tree.
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
model: Model,
|
||||
name: str = "trainable_lemmatizer",
|
||||
*,
|
||||
backoff: Optional[str] = "orth",
|
||||
min_tree_freq: int = 3,
|
||||
overwrite: bool = False,
|
||||
top_k: int = 1,
|
||||
scorer: Optional[Callable] = lemmatizer_score,
|
||||
):
|
||||
"""
|
||||
Construct an edit tree lemmatizer.
|
||||
|
||||
backoff (Optional[str]): backoff to use when the predicted edit trees
|
||||
are not applicable. Must be an attribute of Token or None (leave the
|
||||
lemma unset).
|
||||
min_tree_freq (int): prune trees that are applied less than this
|
||||
frequency in the training data.
|
||||
overwrite (bool): overwrite existing lemma annotations.
|
||||
top_k (int): try to apply at most the k most probable edit trees.
|
||||
"""
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
self.name = name
|
||||
self.backoff = backoff
|
||||
self.min_tree_freq = min_tree_freq
|
||||
self.overwrite = overwrite
|
||||
self.top_k = top_k
|
||||
|
||||
self.trees = EditTrees(self.vocab.strings)
|
||||
self.tree2label: Dict[int, int] = {}
|
||||
|
||||
self.cfg: Dict[str, Any] = {"labels": []}
|
||||
self.scorer = scorer
|
||||
|
||||
def get_loss(
|
||||
self, examples: Iterable[Example], scores: List[Floats2d]
|
||||
) -> Tuple[float, List[Floats2d]]:
|
||||
validate_examples(examples, "EditTreeLemmatizer.get_loss")
|
||||
loss_func = SequenceCategoricalCrossentropy(normalize=False, missing_value=-1)
|
||||
|
||||
truths = []
|
||||
for eg in examples:
|
||||
eg_truths = []
|
||||
for (predicted, gold_lemma) in zip(
|
||||
eg.predicted, eg.get_aligned("LEMMA", as_string=True)
|
||||
):
|
||||
if gold_lemma is None:
|
||||
label = -1
|
||||
else:
|
||||
tree_id = self.trees.add(predicted.text, gold_lemma)
|
||||
label = self.tree2label.get(tree_id, 0)
|
||||
eg_truths.append(label)
|
||||
|
||||
truths.append(eg_truths)
|
||||
|
||||
d_scores, loss = loss_func(scores, truths) # type: ignore
|
||||
if self.model.ops.xp.isnan(loss):
|
||||
raise ValueError(Errors.E910.format(name=self.name))
|
||||
|
||||
return float(loss), d_scores
|
||||
|
||||
def predict(self, docs: Iterable[Doc]) -> List[Ints2d]:
|
||||
n_docs = len(list(docs))
|
||||
if not any(len(doc) for doc in docs):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
n_labels = len(self.cfg["labels"])
|
||||
guesses: List[Ints2d] = [
|
||||
self.model.ops.alloc((0, n_labels), dtype="i") for doc in docs
|
||||
]
|
||||
assert len(guesses) == n_docs
|
||||
return guesses
|
||||
scores = self.model.predict(docs)
|
||||
assert len(scores) == n_docs
|
||||
guesses = self._scores2guesses(docs, scores)
|
||||
assert len(guesses) == n_docs
|
||||
return guesses
|
||||
|
||||
def _scores2guesses(self, docs, scores):
|
||||
guesses = []
|
||||
for doc, doc_scores in zip(docs, scores):
|
||||
if self.top_k == 1:
|
||||
doc_guesses = doc_scores.argmax(axis=1).reshape(-1, 1)
|
||||
else:
|
||||
doc_guesses = np.argsort(doc_scores)[..., : -self.top_k - 1 : -1]
|
||||
|
||||
if not isinstance(doc_guesses, np.ndarray):
|
||||
doc_guesses = doc_guesses.get()
|
||||
|
||||
doc_compat_guesses = []
|
||||
for token, candidates in zip(doc, doc_guesses):
|
||||
tree_id = -1
|
||||
for candidate in candidates:
|
||||
candidate_tree_id = self.cfg["labels"][candidate]
|
||||
|
||||
if self.trees.apply(candidate_tree_id, token.text) is not None:
|
||||
tree_id = candidate_tree_id
|
||||
break
|
||||
doc_compat_guesses.append(tree_id)
|
||||
|
||||
guesses.append(np.array(doc_compat_guesses))
|
||||
|
||||
return guesses
|
||||
|
||||
def set_annotations(self, docs: Iterable[Doc], batch_tree_ids):
|
||||
for i, doc in enumerate(docs):
|
||||
doc_tree_ids = batch_tree_ids[i]
|
||||
if hasattr(doc_tree_ids, "get"):
|
||||
doc_tree_ids = doc_tree_ids.get()
|
||||
for j, tree_id in enumerate(doc_tree_ids):
|
||||
if self.overwrite or doc[j].lemma == 0:
|
||||
# If no applicable tree could be found during prediction,
|
||||
# the special identifier -1 is used. Otherwise the tree
|
||||
# is guaranteed to be applicable.
|
||||
if tree_id == -1:
|
||||
if self.backoff is not None:
|
||||
doc[j].lemma = getattr(doc[j], self.backoff)
|
||||
else:
|
||||
lemma = self.trees.apply(tree_id, doc[j].text)
|
||||
doc[j].lemma_ = lemma
|
||||
|
||||
@property
|
||||
def labels(self) -> Tuple[int, ...]:
|
||||
"""Returns the labels currently added to the component."""
|
||||
return tuple(self.cfg["labels"])
|
||||
|
||||
@property
|
||||
def hide_labels(self) -> bool:
|
||||
return True
|
||||
|
||||
@property
|
||||
def label_data(self) -> Dict:
|
||||
trees = []
|
||||
for tree_id in range(len(self.trees)):
|
||||
tree = self.trees[tree_id]
|
||||
if "orig" in tree:
|
||||
tree["orig"] = self.vocab.strings[tree["orig"]]
|
||||
if "subst" in tree:
|
||||
tree["subst"] = self.vocab.strings[tree["subst"]]
|
||||
trees.append(tree)
|
||||
return dict(trees=trees, labels=tuple(self.cfg["labels"]))
|
||||
|
||||
def initialize(
|
||||
self,
|
||||
get_examples: Callable[[], Iterable[Example]],
|
||||
*,
|
||||
nlp: Optional[Language] = None,
|
||||
labels: Optional[Dict] = None,
|
||||
):
|
||||
validate_get_examples(get_examples, "EditTreeLemmatizer.initialize")
|
||||
|
||||
if labels is None:
|
||||
self._labels_from_data(get_examples)
|
||||
else:
|
||||
self._add_labels(labels)
|
||||
|
||||
# Sample for the model.
|
||||
doc_sample = []
|
||||
label_sample = []
|
||||
for example in islice(get_examples(), 10):
|
||||
doc_sample.append(example.x)
|
||||
gold_labels: List[List[float]] = []
|
||||
for token in example.reference:
|
||||
if token.lemma == 0:
|
||||
gold_label = None
|
||||
else:
|
||||
gold_label = self._pair2label(token.text, token.lemma_)
|
||||
|
||||
gold_labels.append(
|
||||
[
|
||||
1.0 if label == gold_label else 0.0
|
||||
for label in self.cfg["labels"]
|
||||
]
|
||||
)
|
||||
|
||||
gold_labels = cast(Floats2d, gold_labels)
|
||||
label_sample.append(self.model.ops.asarray(gold_labels, dtype="float32"))
|
||||
|
||||
self._require_labels()
|
||||
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
|
||||
assert len(label_sample) > 0, Errors.E923.format(name=self.name)
|
||||
|
||||
self.model.initialize(X=doc_sample, Y=label_sample)
|
||||
|
||||
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
||||
deserializers = {
|
||||
"cfg": lambda b: self.cfg.update(srsly.json_loads(b)),
|
||||
"model": lambda b: self.model.from_bytes(b),
|
||||
"vocab": lambda b: self.vocab.from_bytes(b, exclude=exclude),
|
||||
"trees": lambda b: self.trees.from_bytes(b),
|
||||
}
|
||||
|
||||
util.from_bytes(bytes_data, deserializers, exclude)
|
||||
|
||||
return self
|
||||
|
||||
def to_bytes(self, *, exclude=tuple()):
|
||||
serializers = {
|
||||
"cfg": lambda: srsly.json_dumps(self.cfg),
|
||||
"model": lambda: self.model.to_bytes(),
|
||||
"vocab": lambda: self.vocab.to_bytes(exclude=exclude),
|
||||
"trees": lambda: self.trees.to_bytes(),
|
||||
}
|
||||
|
||||
return util.to_bytes(serializers, exclude)
|
||||
|
||||
def to_disk(self, path, exclude=tuple()):
|
||||
path = util.ensure_path(path)
|
||||
serializers = {
|
||||
"cfg": lambda p: srsly.write_json(p, self.cfg),
|
||||
"model": lambda p: self.model.to_disk(p),
|
||||
"vocab": lambda p: self.vocab.to_disk(p, exclude=exclude),
|
||||
"trees": lambda p: self.trees.to_disk(p),
|
||||
}
|
||||
util.to_disk(path, serializers, exclude)
|
||||
|
||||
def from_disk(self, path, exclude=tuple()):
|
||||
def load_model(p):
|
||||
try:
|
||||
with open(p, "rb") as mfile:
|
||||
self.model.from_bytes(mfile.read())
|
||||
except AttributeError:
|
||||
raise ValueError(Errors.E149) from None
|
||||
|
||||
deserializers = {
|
||||
"cfg": lambda p: self.cfg.update(srsly.read_json(p)),
|
||||
"model": load_model,
|
||||
"vocab": lambda p: self.vocab.from_disk(p, exclude=exclude),
|
||||
"trees": lambda p: self.trees.from_disk(p),
|
||||
}
|
||||
|
||||
util.from_disk(path, deserializers, exclude)
|
||||
return self
|
||||
|
||||
def _add_labels(self, labels: Dict):
|
||||
if "labels" not in labels:
|
||||
raise ValueError(Errors.E857.format(name="labels"))
|
||||
if "trees" not in labels:
|
||||
raise ValueError(Errors.E857.format(name="trees"))
|
||||
|
||||
self.cfg["labels"] = list(labels["labels"])
|
||||
trees = []
|
||||
for tree in labels["trees"]:
|
||||
errors = validate_edit_tree(tree)
|
||||
if errors:
|
||||
raise ValueError(Errors.E1026.format(errors="\n".join(errors)))
|
||||
|
||||
tree = dict(tree)
|
||||
if "orig" in tree:
|
||||
tree["orig"] = self.vocab.strings[tree["orig"]]
|
||||
if "orig" in tree:
|
||||
tree["subst"] = self.vocab.strings[tree["subst"]]
|
||||
|
||||
trees.append(tree)
|
||||
|
||||
self.trees.from_json(trees)
|
||||
|
||||
for label, tree in enumerate(self.labels):
|
||||
self.tree2label[tree] = label
|
||||
|
||||
def _labels_from_data(self, get_examples: Callable[[], Iterable[Example]]):
|
||||
# Count corpus tree frequencies in ad-hoc storage to avoid cluttering
|
||||
# the final pipe/string store.
|
||||
vocab = Vocab()
|
||||
trees = EditTrees(vocab.strings)
|
||||
tree_freqs: Counter = Counter()
|
||||
repr_pairs: Dict = {}
|
||||
for example in get_examples():
|
||||
for token in example.reference:
|
||||
if token.lemma != 0:
|
||||
tree_id = trees.add(token.text, token.lemma_)
|
||||
tree_freqs[tree_id] += 1
|
||||
repr_pairs[tree_id] = (token.text, token.lemma_)
|
||||
|
||||
# Construct trees that make the frequency cut-off using representative
|
||||
# form - token pairs.
|
||||
for tree_id, freq in tree_freqs.items():
|
||||
if freq >= self.min_tree_freq:
|
||||
form, lemma = repr_pairs[tree_id]
|
||||
self._pair2label(form, lemma, add_label=True)
|
||||
|
||||
def _pair2label(self, form, lemma, add_label=False):
|
||||
"""
|
||||
Look up the edit tree identifier for a form/label pair. If the edit
|
||||
tree is unknown and "add_label" is set, the edit tree will be added to
|
||||
the labels.
|
||||
"""
|
||||
tree_id = self.trees.add(form, lemma)
|
||||
if tree_id not in self.tree2label:
|
||||
if not add_label:
|
||||
return None
|
||||
|
||||
self.tree2label[tree_id] = len(self.cfg["labels"])
|
||||
self.cfg["labels"].append(tree_id)
|
||||
return self.tree2label[tree_id]
|
|
@ -6,17 +6,17 @@ import srsly
|
|||
import random
|
||||
from thinc.api import CosineDistance, Model, Optimizer, Config
|
||||
from thinc.api import set_dropout_rate
|
||||
import warnings
|
||||
|
||||
from ..kb import KnowledgeBase, Candidate
|
||||
from ..ml import empty_kb
|
||||
from ..tokens import Doc, Span
|
||||
from .pipe import deserialize_config
|
||||
from .legacy.entity_linker import EntityLinker_v1
|
||||
from .trainable_pipe import TrainablePipe
|
||||
from ..language import Language
|
||||
from ..vocab import Vocab
|
||||
from ..training import Example, validate_examples, validate_get_examples
|
||||
from ..errors import Errors, Warnings
|
||||
from ..errors import Errors
|
||||
from ..util import SimpleFrozenList, registry
|
||||
from .. import util
|
||||
from ..scorer import Scorer
|
||||
|
@ -26,7 +26,7 @@ BACKWARD_OVERWRITE = True
|
|||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.EntityLinker.v1"
|
||||
@architectures = "spacy.EntityLinker.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
|
@ -55,6 +55,7 @@ DEFAULT_NEL_MODEL = Config().from_str(default_model_config)["model"]
|
|||
"get_candidates": {"@misc": "spacy.CandidateGenerator.v1"},
|
||||
"overwrite": True,
|
||||
"scorer": {"@scorers": "spacy.entity_linker_scorer.v1"},
|
||||
"use_gold_ents": True,
|
||||
},
|
||||
default_score_weights={
|
||||
"nel_micro_f": 1.0,
|
||||
|
@ -75,6 +76,7 @@ def make_entity_linker(
|
|||
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
||||
overwrite: bool,
|
||||
scorer: Optional[Callable],
|
||||
use_gold_ents: bool,
|
||||
):
|
||||
"""Construct an EntityLinker component.
|
||||
|
||||
|
@ -90,6 +92,22 @@ def make_entity_linker(
|
|||
produces a list of candidates, given a certain knowledge base and a textual mention.
|
||||
scorer (Optional[Callable]): The scoring method.
|
||||
"""
|
||||
|
||||
if not model.attrs.get("include_span_maker", False):
|
||||
# The only difference in arguments here is that use_gold_ents is not available
|
||||
return EntityLinker_v1(
|
||||
nlp.vocab,
|
||||
model,
|
||||
name,
|
||||
labels_discard=labels_discard,
|
||||
n_sents=n_sents,
|
||||
incl_prior=incl_prior,
|
||||
incl_context=incl_context,
|
||||
entity_vector_length=entity_vector_length,
|
||||
get_candidates=get_candidates,
|
||||
overwrite=overwrite,
|
||||
scorer=scorer,
|
||||
)
|
||||
return EntityLinker(
|
||||
nlp.vocab,
|
||||
model,
|
||||
|
@ -102,6 +120,7 @@ def make_entity_linker(
|
|||
get_candidates=get_candidates,
|
||||
overwrite=overwrite,
|
||||
scorer=scorer,
|
||||
use_gold_ents=use_gold_ents,
|
||||
)
|
||||
|
||||
|
||||
|
@ -136,6 +155,7 @@ class EntityLinker(TrainablePipe):
|
|||
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
||||
overwrite: bool = BACKWARD_OVERWRITE,
|
||||
scorer: Optional[Callable] = entity_linker_score,
|
||||
use_gold_ents: bool,
|
||||
) -> None:
|
||||
"""Initialize an entity linker.
|
||||
|
||||
|
@ -152,6 +172,8 @@ class EntityLinker(TrainablePipe):
|
|||
produces a list of candidates, given a certain knowledge base and a textual mention.
|
||||
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||
Scorer.score_links.
|
||||
use_gold_ents (bool): Whether to copy entities from gold docs or not. If false, another
|
||||
component must provide entity annotations.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#init
|
||||
"""
|
||||
|
@ -169,6 +191,7 @@ class EntityLinker(TrainablePipe):
|
|||
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
||||
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
||||
self.scorer = scorer
|
||||
self.use_gold_ents = use_gold_ents
|
||||
|
||||
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
||||
"""Define the KB of this pipe by providing a function that will
|
||||
|
@ -212,14 +235,48 @@ class EntityLinker(TrainablePipe):
|
|||
doc_sample = []
|
||||
vector_sample = []
|
||||
for example in islice(get_examples(), 10):
|
||||
doc_sample.append(example.x)
|
||||
doc = example.x
|
||||
if self.use_gold_ents:
|
||||
doc.ents = example.y.ents
|
||||
doc_sample.append(doc)
|
||||
vector_sample.append(self.model.ops.alloc1f(nO))
|
||||
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
|
||||
assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
|
||||
|
||||
# XXX In order for size estimation to work, there has to be at least
|
||||
# one entity. It's not used for training so it doesn't have to be real,
|
||||
# so we add a fake one if none are present.
|
||||
# We can't use Doc.has_annotation here because it can be True for docs
|
||||
# that have been through an NER component but got no entities.
|
||||
has_annotations = any([doc.ents for doc in doc_sample])
|
||||
if not has_annotations:
|
||||
doc = doc_sample[0]
|
||||
ent = doc[0:1]
|
||||
ent.label_ = "XXX"
|
||||
doc.ents = (ent,)
|
||||
|
||||
self.model.initialize(
|
||||
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
|
||||
)
|
||||
|
||||
if not has_annotations:
|
||||
# Clean up dummy annotation
|
||||
doc.ents = []
|
||||
|
||||
def batch_has_learnable_example(self, examples):
|
||||
"""Check if a batch contains a learnable example.
|
||||
|
||||
If one isn't present, then the update step needs to be skipped.
|
||||
"""
|
||||
|
||||
for eg in examples:
|
||||
for ent in eg.predicted.ents:
|
||||
candidates = list(self.get_candidates(self.kb, ent))
|
||||
if candidates:
|
||||
return True
|
||||
|
||||
return False
|
||||
|
||||
def update(
|
||||
self,
|
||||
examples: Iterable[Example],
|
||||
|
@ -247,35 +304,29 @@ class EntityLinker(TrainablePipe):
|
|||
if not examples:
|
||||
return losses
|
||||
validate_examples(examples, "EntityLinker.update")
|
||||
sentence_docs = []
|
||||
for eg in examples:
|
||||
sentences = [s for s in eg.reference.sents]
|
||||
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
|
||||
for ent in eg.reference.ents:
|
||||
# KB ID of the first token is the same as the whole span
|
||||
kb_id = kb_ids[ent.start]
|
||||
if kb_id:
|
||||
try:
|
||||
# find the sentence in the list of sentences.
|
||||
sent_index = sentences.index(ent.sent)
|
||||
except AttributeError:
|
||||
# Catch the exception when ent.sent is None and provide a user-friendly warning
|
||||
raise RuntimeError(Errors.E030) from None
|
||||
# get n previous sentences, if there are any
|
||||
start_sentence = max(0, sent_index - self.n_sents)
|
||||
# get n posterior sentences, or as many < n as there are
|
||||
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
|
||||
# get token positions
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
# append that span as a doc to training
|
||||
sent_doc = eg.predicted[start_token:end_token].as_doc()
|
||||
sentence_docs.append(sent_doc)
|
||||
|
||||
set_dropout_rate(self.model, drop)
|
||||
if not sentence_docs:
|
||||
warnings.warn(Warnings.W093.format(name="Entity Linker"))
|
||||
docs = [eg.predicted for eg in examples]
|
||||
# save to restore later
|
||||
old_ents = [doc.ents for doc in docs]
|
||||
|
||||
for doc, ex in zip(docs, examples):
|
||||
if self.use_gold_ents:
|
||||
doc.ents = ex.reference.ents
|
||||
else:
|
||||
# only keep matching ents
|
||||
doc.ents = ex.get_matching_ents()
|
||||
|
||||
# make sure we have something to learn from, if not, short-circuit
|
||||
if not self.batch_has_learnable_example(examples):
|
||||
return losses
|
||||
sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
|
||||
|
||||
sentence_encodings, bp_context = self.model.begin_update(docs)
|
||||
|
||||
# now restore the ents
|
||||
for doc, old in zip(docs, old_ents):
|
||||
doc.ents = old
|
||||
|
||||
loss, d_scores = self.get_loss(
|
||||
sentence_encodings=sentence_encodings, examples=examples
|
||||
)
|
||||
|
@ -288,24 +339,38 @@ class EntityLinker(TrainablePipe):
|
|||
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
|
||||
validate_examples(examples, "EntityLinker.get_loss")
|
||||
entity_encodings = []
|
||||
eidx = 0 # indices in gold entities to keep
|
||||
keep_ents = [] # indices in sentence_encodings to keep
|
||||
|
||||
for eg in examples:
|
||||
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
|
||||
|
||||
for ent in eg.reference.ents:
|
||||
kb_id = kb_ids[ent.start]
|
||||
if kb_id:
|
||||
entity_encoding = self.kb.get_vector(kb_id)
|
||||
entity_encodings.append(entity_encoding)
|
||||
keep_ents.append(eidx)
|
||||
|
||||
eidx += 1
|
||||
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
|
||||
if sentence_encodings.shape != entity_encodings.shape:
|
||||
selected_encodings = sentence_encodings[keep_ents]
|
||||
|
||||
# If the entity encodings list is empty, then
|
||||
if selected_encodings.shape != entity_encodings.shape:
|
||||
err = Errors.E147.format(
|
||||
method="get_loss", msg="gold entities do not match up"
|
||||
)
|
||||
raise RuntimeError(err)
|
||||
# TODO: fix typing issue here
|
||||
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore
|
||||
loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
|
||||
gradients = self.distance.get_grad(selected_encodings, entity_encodings) # type: ignore
|
||||
# to match the input size, we need to give a zero gradient for items not in the kb
|
||||
out = self.model.ops.alloc2f(*sentence_encodings.shape)
|
||||
out[keep_ents] = gradients
|
||||
|
||||
loss = self.distance.get_loss(selected_encodings, entity_encodings) # type: ignore
|
||||
loss = loss / len(entity_encodings)
|
||||
return float(loss), gradients
|
||||
return float(loss), out
|
||||
|
||||
def predict(self, docs: Iterable[Doc]) -> List[str]:
|
||||
"""Apply the pipeline's model to a batch of docs, without modifying them.
|
||||
|
|
3
spacy/pipeline/legacy/__init__.py
Normal file
3
spacy/pipeline/legacy/__init__.py
Normal file
|
@ -0,0 +1,3 @@
|
|||
from .entity_linker import EntityLinker_v1
|
||||
|
||||
__all__ = ["EntityLinker_v1"]
|
427
spacy/pipeline/legacy/entity_linker.py
Normal file
427
spacy/pipeline/legacy/entity_linker.py
Normal file
|
@ -0,0 +1,427 @@
|
|||
# This file is present to provide a prior version of the EntityLinker component
|
||||
# for backwards compatability. For details see #9669.
|
||||
|
||||
from typing import Optional, Iterable, Callable, Dict, Union, List, Any
|
||||
from thinc.types import Floats2d
|
||||
from pathlib import Path
|
||||
from itertools import islice
|
||||
import srsly
|
||||
import random
|
||||
from thinc.api import CosineDistance, Model, Optimizer, Config
|
||||
from thinc.api import set_dropout_rate
|
||||
import warnings
|
||||
|
||||
from ...kb import KnowledgeBase, Candidate
|
||||
from ...ml import empty_kb
|
||||
from ...tokens import Doc, Span
|
||||
from ..pipe import deserialize_config
|
||||
from ..trainable_pipe import TrainablePipe
|
||||
from ...language import Language
|
||||
from ...vocab import Vocab
|
||||
from ...training import Example, validate_examples, validate_get_examples
|
||||
from ...errors import Errors, Warnings
|
||||
from ...util import SimpleFrozenList, registry
|
||||
from ... import util
|
||||
from ...scorer import Scorer
|
||||
|
||||
# See #9050
|
||||
BACKWARD_OVERWRITE = True
|
||||
|
||||
|
||||
def entity_linker_score(examples, **kwargs):
|
||||
return Scorer.score_links(examples, negative_labels=[EntityLinker_v1.NIL], **kwargs)
|
||||
|
||||
|
||||
class EntityLinker_v1(TrainablePipe):
|
||||
"""Pipeline component for named entity linking.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker
|
||||
"""
|
||||
|
||||
NIL = "NIL" # string used to refer to a non-existing link
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
vocab: Vocab,
|
||||
model: Model,
|
||||
name: str = "entity_linker",
|
||||
*,
|
||||
labels_discard: Iterable[str],
|
||||
n_sents: int,
|
||||
incl_prior: bool,
|
||||
incl_context: bool,
|
||||
entity_vector_length: int,
|
||||
get_candidates: Callable[[KnowledgeBase, Span], Iterable[Candidate]],
|
||||
overwrite: bool = BACKWARD_OVERWRITE,
|
||||
scorer: Optional[Callable] = entity_linker_score,
|
||||
) -> None:
|
||||
"""Initialize an entity linker.
|
||||
|
||||
vocab (Vocab): The shared vocabulary.
|
||||
model (thinc.api.Model): The Thinc Model powering the pipeline component.
|
||||
name (str): The component instance name, used to add entries to the
|
||||
losses during training.
|
||||
labels_discard (Iterable[str]): NER labels that will automatically get a "NIL" prediction.
|
||||
n_sents (int): The number of neighbouring sentences to take into account.
|
||||
incl_prior (bool): Whether or not to include prior probabilities from the KB in the model.
|
||||
incl_context (bool): Whether or not to include the local context in the model.
|
||||
entity_vector_length (int): Size of encoding vectors in the KB.
|
||||
get_candidates (Callable[[KnowledgeBase, Span], Iterable[Candidate]]): Function that
|
||||
produces a list of candidates, given a certain knowledge base and a textual mention.
|
||||
scorer (Optional[Callable]): The scoring method. Defaults to
|
||||
Scorer.score_links.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#init
|
||||
"""
|
||||
self.vocab = vocab
|
||||
self.model = model
|
||||
self.name = name
|
||||
self.labels_discard = list(labels_discard)
|
||||
self.n_sents = n_sents
|
||||
self.incl_prior = incl_prior
|
||||
self.incl_context = incl_context
|
||||
self.get_candidates = get_candidates
|
||||
self.cfg: Dict[str, Any] = {"overwrite": overwrite}
|
||||
self.distance = CosineDistance(normalize=False)
|
||||
# how many neighbour sentences to take into account
|
||||
# create an empty KB by default. If you want to load a predefined one, specify it in 'initialize'.
|
||||
self.kb = empty_kb(entity_vector_length)(self.vocab)
|
||||
self.scorer = scorer
|
||||
|
||||
def set_kb(self, kb_loader: Callable[[Vocab], KnowledgeBase]):
|
||||
"""Define the KB of this pipe by providing a function that will
|
||||
create it using this object's vocab."""
|
||||
if not callable(kb_loader):
|
||||
raise ValueError(Errors.E885.format(arg_type=type(kb_loader)))
|
||||
|
||||
self.kb = kb_loader(self.vocab)
|
||||
|
||||
def validate_kb(self) -> None:
|
||||
# Raise an error if the knowledge base is not initialized.
|
||||
if self.kb is None:
|
||||
raise ValueError(Errors.E1018.format(name=self.name))
|
||||
if len(self.kb) == 0:
|
||||
raise ValueError(Errors.E139.format(name=self.name))
|
||||
|
||||
def initialize(
|
||||
self,
|
||||
get_examples: Callable[[], Iterable[Example]],
|
||||
*,
|
||||
nlp: Optional[Language] = None,
|
||||
kb_loader: Optional[Callable[[Vocab], KnowledgeBase]] = None,
|
||||
):
|
||||
"""Initialize the pipe for training, using a representative set
|
||||
of data examples.
|
||||
|
||||
get_examples (Callable[[], Iterable[Example]]): Function that
|
||||
returns a representative sample of gold-standard Example objects.
|
||||
nlp (Language): The current nlp object the component is part of.
|
||||
kb_loader (Callable[[Vocab], KnowledgeBase]): A function that creates a KnowledgeBase from a Vocab instance.
|
||||
Note that providing this argument, will overwrite all data accumulated in the current KB.
|
||||
Use this only when loading a KB as-such from file.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#initialize
|
||||
"""
|
||||
validate_get_examples(get_examples, "EntityLinker_v1.initialize")
|
||||
if kb_loader is not None:
|
||||
self.set_kb(kb_loader)
|
||||
self.validate_kb()
|
||||
nO = self.kb.entity_vector_length
|
||||
doc_sample = []
|
||||
vector_sample = []
|
||||
for example in islice(get_examples(), 10):
|
||||
doc_sample.append(example.x)
|
||||
vector_sample.append(self.model.ops.alloc1f(nO))
|
||||
assert len(doc_sample) > 0, Errors.E923.format(name=self.name)
|
||||
assert len(vector_sample) > 0, Errors.E923.format(name=self.name)
|
||||
self.model.initialize(
|
||||
X=doc_sample, Y=self.model.ops.asarray(vector_sample, dtype="float32")
|
||||
)
|
||||
|
||||
def update(
|
||||
self,
|
||||
examples: Iterable[Example],
|
||||
*,
|
||||
drop: float = 0.0,
|
||||
sgd: Optional[Optimizer] = None,
|
||||
losses: Optional[Dict[str, float]] = None,
|
||||
) -> Dict[str, float]:
|
||||
"""Learn from a batch of documents and gold-standard information,
|
||||
updating the pipe's model. Delegates to predict and get_loss.
|
||||
|
||||
examples (Iterable[Example]): A batch of Example objects.
|
||||
drop (float): The dropout rate.
|
||||
sgd (thinc.api.Optimizer): The optimizer.
|
||||
losses (Dict[str, float]): Optional record of the loss during training.
|
||||
Updated using the component name as the key.
|
||||
RETURNS (Dict[str, float]): The updated losses dictionary.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#update
|
||||
"""
|
||||
self.validate_kb()
|
||||
if losses is None:
|
||||
losses = {}
|
||||
losses.setdefault(self.name, 0.0)
|
||||
if not examples:
|
||||
return losses
|
||||
validate_examples(examples, "EntityLinker_v1.update")
|
||||
sentence_docs = []
|
||||
for eg in examples:
|
||||
sentences = [s for s in eg.reference.sents]
|
||||
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
|
||||
for ent in eg.reference.ents:
|
||||
# KB ID of the first token is the same as the whole span
|
||||
kb_id = kb_ids[ent.start]
|
||||
if kb_id:
|
||||
try:
|
||||
# find the sentence in the list of sentences.
|
||||
sent_index = sentences.index(ent.sent)
|
||||
except AttributeError:
|
||||
# Catch the exception when ent.sent is None and provide a user-friendly warning
|
||||
raise RuntimeError(Errors.E030) from None
|
||||
# get n previous sentences, if there are any
|
||||
start_sentence = max(0, sent_index - self.n_sents)
|
||||
# get n posterior sentences, or as many < n as there are
|
||||
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
|
||||
# get token positions
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
# append that span as a doc to training
|
||||
sent_doc = eg.predicted[start_token:end_token].as_doc()
|
||||
sentence_docs.append(sent_doc)
|
||||
set_dropout_rate(self.model, drop)
|
||||
if not sentence_docs:
|
||||
warnings.warn(Warnings.W093.format(name="Entity Linker"))
|
||||
return losses
|
||||
sentence_encodings, bp_context = self.model.begin_update(sentence_docs)
|
||||
loss, d_scores = self.get_loss(
|
||||
sentence_encodings=sentence_encodings, examples=examples
|
||||
)
|
||||
bp_context(d_scores)
|
||||
if sgd is not None:
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += loss
|
||||
return losses
|
||||
|
||||
def get_loss(self, examples: Iterable[Example], sentence_encodings: Floats2d):
|
||||
validate_examples(examples, "EntityLinker_v1.get_loss")
|
||||
entity_encodings = []
|
||||
for eg in examples:
|
||||
kb_ids = eg.get_aligned("ENT_KB_ID", as_string=True)
|
||||
for ent in eg.reference.ents:
|
||||
kb_id = kb_ids[ent.start]
|
||||
if kb_id:
|
||||
entity_encoding = self.kb.get_vector(kb_id)
|
||||
entity_encodings.append(entity_encoding)
|
||||
entity_encodings = self.model.ops.asarray(entity_encodings, dtype="float32")
|
||||
if sentence_encodings.shape != entity_encodings.shape:
|
||||
err = Errors.E147.format(
|
||||
method="get_loss", msg="gold entities do not match up"
|
||||
)
|
||||
raise RuntimeError(err)
|
||||
# TODO: fix typing issue here
|
||||
gradients = self.distance.get_grad(sentence_encodings, entity_encodings) # type: ignore
|
||||
loss = self.distance.get_loss(sentence_encodings, entity_encodings) # type: ignore
|
||||
loss = loss / len(entity_encodings)
|
||||
return float(loss), gradients
|
||||
|
||||
def predict(self, docs: Iterable[Doc]) -> List[str]:
|
||||
"""Apply the pipeline's model to a batch of docs, without modifying them.
|
||||
Returns the KB IDs for each entity in each doc, including NIL if there is
|
||||
no prediction.
|
||||
|
||||
docs (Iterable[Doc]): The documents to predict.
|
||||
RETURNS (List[str]): The models prediction for each document.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#predict
|
||||
"""
|
||||
self.validate_kb()
|
||||
entity_count = 0
|
||||
final_kb_ids: List[str] = []
|
||||
if not docs:
|
||||
return final_kb_ids
|
||||
if isinstance(docs, Doc):
|
||||
docs = [docs]
|
||||
for i, doc in enumerate(docs):
|
||||
sentences = [s for s in doc.sents]
|
||||
if len(doc) > 0:
|
||||
# Looping through each entity (TODO: rewrite)
|
||||
for ent in doc.ents:
|
||||
sent = ent.sent
|
||||
sent_index = sentences.index(sent)
|
||||
assert sent_index >= 0
|
||||
# get n_neighbour sentences, clipped to the length of the document
|
||||
start_sentence = max(0, sent_index - self.n_sents)
|
||||
end_sentence = min(len(sentences) - 1, sent_index + self.n_sents)
|
||||
start_token = sentences[start_sentence].start
|
||||
end_token = sentences[end_sentence].end
|
||||
sent_doc = doc[start_token:end_token].as_doc()
|
||||
# currently, the context is the same for each entity in a sentence (should be refined)
|
||||
xp = self.model.ops.xp
|
||||
if self.incl_context:
|
||||
sentence_encoding = self.model.predict([sent_doc])[0]
|
||||
sentence_encoding_t = sentence_encoding.T
|
||||
sentence_norm = xp.linalg.norm(sentence_encoding_t)
|
||||
entity_count += 1
|
||||
if ent.label_ in self.labels_discard:
|
||||
# ignoring this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
else:
|
||||
candidates = list(self.get_candidates(self.kb, ent))
|
||||
if not candidates:
|
||||
# no prediction possible for this entity - setting to NIL
|
||||
final_kb_ids.append(self.NIL)
|
||||
elif len(candidates) == 1:
|
||||
# shortcut for efficiency reasons: take the 1 candidate
|
||||
# TODO: thresholding
|
||||
final_kb_ids.append(candidates[0].entity_)
|
||||
else:
|
||||
random.shuffle(candidates)
|
||||
# set all prior probabilities to 0 if incl_prior=False
|
||||
prior_probs = xp.asarray([c.prior_prob for c in candidates])
|
||||
if not self.incl_prior:
|
||||
prior_probs = xp.asarray([0.0 for _ in candidates])
|
||||
scores = prior_probs
|
||||
# add in similarity from the context
|
||||
if self.incl_context:
|
||||
entity_encodings = xp.asarray(
|
||||
[c.entity_vector for c in candidates]
|
||||
)
|
||||
entity_norm = xp.linalg.norm(entity_encodings, axis=1)
|
||||
if len(entity_encodings) != len(prior_probs):
|
||||
raise RuntimeError(
|
||||
Errors.E147.format(
|
||||
method="predict",
|
||||
msg="vectors not of equal length",
|
||||
)
|
||||
)
|
||||
# cosine similarity
|
||||
sims = xp.dot(entity_encodings, sentence_encoding_t) / (
|
||||
sentence_norm * entity_norm
|
||||
)
|
||||
if sims.shape != prior_probs.shape:
|
||||
raise ValueError(Errors.E161)
|
||||
scores = prior_probs + sims - (prior_probs * sims)
|
||||
# TODO: thresholding
|
||||
best_index = scores.argmax().item()
|
||||
best_candidate = candidates[best_index]
|
||||
final_kb_ids.append(best_candidate.entity_)
|
||||
if not (len(final_kb_ids) == entity_count):
|
||||
err = Errors.E147.format(
|
||||
method="predict", msg="result variables not of equal length"
|
||||
)
|
||||
raise RuntimeError(err)
|
||||
return final_kb_ids
|
||||
|
||||
def set_annotations(self, docs: Iterable[Doc], kb_ids: List[str]) -> None:
|
||||
"""Modify a batch of documents, using pre-computed scores.
|
||||
|
||||
docs (Iterable[Doc]): The documents to modify.
|
||||
kb_ids (List[str]): The IDs to set, produced by EntityLinker.predict.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#set_annotations
|
||||
"""
|
||||
count_ents = len([ent for doc in docs for ent in doc.ents])
|
||||
if count_ents != len(kb_ids):
|
||||
raise ValueError(Errors.E148.format(ents=count_ents, ids=len(kb_ids)))
|
||||
i = 0
|
||||
overwrite = self.cfg["overwrite"]
|
||||
for doc in docs:
|
||||
for ent in doc.ents:
|
||||
kb_id = kb_ids[i]
|
||||
i += 1
|
||||
for token in ent:
|
||||
if token.ent_kb_id == 0 or overwrite:
|
||||
token.ent_kb_id_ = kb_id
|
||||
|
||||
def to_bytes(self, *, exclude=tuple()):
|
||||
"""Serialize the pipe to a bytestring.
|
||||
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (bytes): The serialized object.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#to_bytes
|
||||
"""
|
||||
self._validate_serialization_attrs()
|
||||
serialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
serialize["cfg"] = lambda: srsly.json_dumps(self.cfg)
|
||||
serialize["vocab"] = lambda: self.vocab.to_bytes(exclude=exclude)
|
||||
serialize["kb"] = self.kb.to_bytes
|
||||
serialize["model"] = self.model.to_bytes
|
||||
return util.to_bytes(serialize, exclude)
|
||||
|
||||
def from_bytes(self, bytes_data, *, exclude=tuple()):
|
||||
"""Load the pipe from a bytestring.
|
||||
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (TrainablePipe): The loaded object.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#from_bytes
|
||||
"""
|
||||
self._validate_serialization_attrs()
|
||||
|
||||
def load_model(b):
|
||||
try:
|
||||
self.model.from_bytes(b)
|
||||
except AttributeError:
|
||||
raise ValueError(Errors.E149) from None
|
||||
|
||||
deserialize = {}
|
||||
if hasattr(self, "cfg") and self.cfg is not None:
|
||||
deserialize["cfg"] = lambda b: self.cfg.update(srsly.json_loads(b))
|
||||
deserialize["vocab"] = lambda b: self.vocab.from_bytes(b, exclude=exclude)
|
||||
deserialize["kb"] = lambda b: self.kb.from_bytes(b)
|
||||
deserialize["model"] = load_model
|
||||
util.from_bytes(bytes_data, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def to_disk(
|
||||
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
||||
) -> None:
|
||||
"""Serialize the pipe to disk.
|
||||
|
||||
path (str / Path): Path to a directory.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#to_disk
|
||||
"""
|
||||
serialize = {}
|
||||
serialize["vocab"] = lambda p: self.vocab.to_disk(p, exclude=exclude)
|
||||
serialize["cfg"] = lambda p: srsly.write_json(p, self.cfg)
|
||||
serialize["kb"] = lambda p: self.kb.to_disk(p)
|
||||
serialize["model"] = lambda p: self.model.to_disk(p)
|
||||
util.to_disk(path, serialize, exclude)
|
||||
|
||||
def from_disk(
|
||||
self, path: Union[str, Path], *, exclude: Iterable[str] = SimpleFrozenList()
|
||||
) -> "EntityLinker_v1":
|
||||
"""Load the pipe from disk. Modifies the object in place and returns it.
|
||||
|
||||
path (str / Path): Path to a directory.
|
||||
exclude (Iterable[str]): String names of serialization fields to exclude.
|
||||
RETURNS (EntityLinker): The modified EntityLinker object.
|
||||
|
||||
DOCS: https://spacy.io/api/entitylinker#from_disk
|
||||
"""
|
||||
|
||||
def load_model(p):
|
||||
try:
|
||||
with p.open("rb") as infile:
|
||||
self.model.from_bytes(infile.read())
|
||||
except AttributeError:
|
||||
raise ValueError(Errors.E149) from None
|
||||
|
||||
deserialize: Dict[str, Callable[[Any], Any]] = {}
|
||||
deserialize["cfg"] = lambda p: self.cfg.update(deserialize_config(p))
|
||||
deserialize["vocab"] = lambda p: self.vocab.from_disk(p, exclude=exclude)
|
||||
deserialize["kb"] = lambda p: self.kb.from_disk(p)
|
||||
deserialize["model"] = load_model
|
||||
util.from_disk(path, deserialize, exclude)
|
||||
return self
|
||||
|
||||
def rehearse(self, examples, *, sgd=None, losses=None, **config):
|
||||
raise NotImplementedError
|
||||
|
||||
def add_label(self, label):
|
||||
raise NotImplementedError
|
|
@ -25,7 +25,7 @@ BACKWARD_EXTEND = False
|
|||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.Tok2Vec.v2"
|
||||
|
|
|
@ -20,7 +20,7 @@ BACKWARD_OVERWRITE = False
|
|||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
|
|
|
@ -272,6 +272,24 @@ class SpanCategorizer(TrainablePipe):
|
|||
scores = self.model.predict((docs, indices)) # type: ignore
|
||||
return indices, scores
|
||||
|
||||
def set_candidates(
|
||||
self, docs: Iterable[Doc], *, candidates_key: str = "candidates"
|
||||
) -> None:
|
||||
"""Use the spancat suggester to add a list of span candidates to a list of docs.
|
||||
This method is intended to be used for debugging purposes.
|
||||
|
||||
docs (Iterable[Doc]): The documents to modify.
|
||||
candidates_key (str): Key of the Doc.spans dict to save the candidate spans under.
|
||||
|
||||
DOCS: https://spacy.io/api/spancategorizer#set_candidates
|
||||
"""
|
||||
suggester_output = self.suggester(docs, ops=self.model.ops)
|
||||
|
||||
for candidates, doc in zip(suggester_output, docs): # type: ignore
|
||||
doc.spans[candidates_key] = []
|
||||
for index in candidates.dataXd:
|
||||
doc.spans[candidates_key].append(doc[index[0] : index[1]])
|
||||
|
||||
def set_annotations(self, docs: Iterable[Doc], indices_scores) -> None:
|
||||
"""Modify a batch of Doc objects, using pre-computed scores.
|
||||
|
||||
|
@ -378,7 +396,7 @@ class SpanCategorizer(TrainablePipe):
|
|||
# If the prediction is 0.9 and it's false, the gradient will be
|
||||
# 0.9 (0.9 - 0.0)
|
||||
d_scores = scores - target
|
||||
loss = float((d_scores ** 2).sum())
|
||||
loss = float((d_scores**2).sum())
|
||||
return loss, d_scores
|
||||
|
||||
def initialize(
|
||||
|
|
|
@ -27,7 +27,7 @@ BACKWARD_OVERWRITE = False
|
|||
|
||||
default_model_config = """
|
||||
[model]
|
||||
@architectures = "spacy.Tagger.v1"
|
||||
@architectures = "spacy.Tagger.v2"
|
||||
|
||||
[model.tok2vec]
|
||||
@architectures = "spacy.HashEmbedCNN.v2"
|
||||
|
@ -225,6 +225,7 @@ class Tagger(TrainablePipe):
|
|||
|
||||
DOCS: https://spacy.io/api/tagger#rehearse
|
||||
"""
|
||||
loss_func = SequenceCategoricalCrossentropy()
|
||||
if losses is None:
|
||||
losses = {}
|
||||
losses.setdefault(self.name, 0.0)
|
||||
|
@ -236,12 +237,12 @@ class Tagger(TrainablePipe):
|
|||
# Handle cases where there are no tokens in any docs.
|
||||
return losses
|
||||
set_dropout_rate(self.model, drop)
|
||||
guesses, backprop = self.model.begin_update(docs)
|
||||
target = self._rehearsal_model(examples)
|
||||
gradient = guesses - target
|
||||
backprop(gradient)
|
||||
tag_scores, bp_tag_scores = self.model.begin_update(docs)
|
||||
tutor_tag_scores, _ = self._rehearsal_model.begin_update(docs)
|
||||
grads, loss = loss_func(tag_scores, tutor_tag_scores)
|
||||
bp_tag_scores(grads)
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += (gradient**2).sum()
|
||||
losses[self.name] += loss
|
||||
return losses
|
||||
|
||||
def get_loss(self, examples, scores):
|
||||
|
|
|
@ -283,12 +283,12 @@ class TextCategorizer(TrainablePipe):
|
|||
return losses
|
||||
set_dropout_rate(self.model, drop)
|
||||
scores, bp_scores = self.model.begin_update(docs)
|
||||
target = self._rehearsal_model(examples)
|
||||
target, _ = self._rehearsal_model.begin_update(docs)
|
||||
gradient = scores - target
|
||||
bp_scores(gradient)
|
||||
if sgd is not None:
|
||||
self.finish_update(sgd)
|
||||
losses[self.name] += (gradient ** 2).sum()
|
||||
losses[self.name] += (gradient**2).sum()
|
||||
return losses
|
||||
|
||||
def _examples_to_truth(
|
||||
|
@ -320,9 +320,9 @@ class TextCategorizer(TrainablePipe):
|
|||
self._validate_categories(examples)
|
||||
truths, not_missing = self._examples_to_truth(examples)
|
||||
not_missing = self.model.ops.asarray(not_missing) # type: ignore
|
||||
d_scores = (scores - truths)
|
||||
d_scores = scores - truths
|
||||
d_scores *= not_missing
|
||||
mean_square_error = (d_scores ** 2).mean()
|
||||
mean_square_error = (d_scores**2).mean()
|
||||
return float(mean_square_error), d_scores
|
||||
|
||||
def add_label(self, label: str) -> int:
|
||||
|
|
|
@ -118,6 +118,10 @@ class Tok2Vec(TrainablePipe):
|
|||
|
||||
DOCS: https://spacy.io/api/tok2vec#predict
|
||||
"""
|
||||
if not any(len(doc) for doc in docs):
|
||||
# Handle cases where there are no tokens in any docs.
|
||||
width = self.model.get_dim("nO")
|
||||
return [self.model.ops.alloc((0, width)) for doc in docs]
|
||||
tokvecs = self.model.predict(docs)
|
||||
batch_id = Tok2VecListener.get_batch_id(docs)
|
||||
for listener in self.listeners:
|
||||
|
|
|
@ -228,7 +228,7 @@ class Scorer:
|
|||
if token.orth_.isspace():
|
||||
continue
|
||||
if align.x2y.lengths[token.i] == 1:
|
||||
gold_i = align.x2y[token.i].dataXd[0, 0]
|
||||
gold_i = align.x2y[token.i][0]
|
||||
if gold_i not in missing_indices:
|
||||
pred_tags.add((gold_i, getter(token, attr)))
|
||||
tag_score.score_set(pred_tags, gold_tags)
|
||||
|
@ -287,7 +287,7 @@ class Scorer:
|
|||
if token.orth_.isspace():
|
||||
continue
|
||||
if align.x2y.lengths[token.i] == 1:
|
||||
gold_i = align.x2y[token.i].dataXd[0, 0]
|
||||
gold_i = align.x2y[token.i][0]
|
||||
if gold_i not in missing_indices:
|
||||
value = getter(token, attr)
|
||||
morph = gold_doc.vocab.strings[value]
|
||||
|
@ -694,13 +694,13 @@ class Scorer:
|
|||
if align.x2y.lengths[token.i] != 1:
|
||||
gold_i = None # type: ignore
|
||||
else:
|
||||
gold_i = align.x2y[token.i].dataXd[0, 0]
|
||||
gold_i = align.x2y[token.i][0]
|
||||
if gold_i not in missing_indices:
|
||||
dep = getter(token, attr)
|
||||
head = head_getter(token, head_attr)
|
||||
if dep not in ignore_labels and token.orth_.strip():
|
||||
if align.x2y.lengths[head.i] == 1:
|
||||
gold_head = align.x2y[head.i].dataXd[0, 0]
|
||||
gold_head = align.x2y[head.i][0]
|
||||
else:
|
||||
gold_head = None
|
||||
# None is indistinct, so we can't just add it to the set
|
||||
|
@ -750,7 +750,7 @@ def get_ner_prf(examples: Iterable[Example], **kwargs) -> Dict[str, Any]:
|
|||
for pred_ent in eg.x.ents:
|
||||
if pred_ent.label_ not in score_per_type:
|
||||
score_per_type[pred_ent.label_] = PRFScore()
|
||||
indices = align_x2y[pred_ent.start : pred_ent.end].dataXd.ravel()
|
||||
indices = align_x2y[pred_ent.start : pred_ent.end]
|
||||
if len(indices):
|
||||
g_span = eg.y[indices[0] : indices[-1] + 1]
|
||||
# Check we aren't missing annotation on this span. If so,
|
||||
|
|
|
@ -1,4 +1,4 @@
|
|||
from typing import Optional, Iterable, Iterator, Union, Any
|
||||
from typing import Optional, Iterable, Iterator, Union, Any, overload
|
||||
from pathlib import Path
|
||||
|
||||
def get_string_id(key: Union[str, int]) -> int: ...
|
||||
|
@ -7,7 +7,10 @@ class StringStore:
|
|||
def __init__(
|
||||
self, strings: Optional[Iterable[str]] = ..., freeze: bool = ...
|
||||
) -> None: ...
|
||||
def __getitem__(self, string_or_id: Union[bytes, str, int]) -> Union[str, int]: ...
|
||||
@overload
|
||||
def __getitem__(self, string_or_id: Union[bytes, str]) -> int: ...
|
||||
@overload
|
||||
def __getitem__(self, string_or_id: int) -> str: ...
|
||||
def as_int(self, key: Union[bytes, str, int]) -> int: ...
|
||||
def as_string(self, key: Union[bytes, str, int]) -> str: ...
|
||||
def add(self, string: str) -> int: ...
|
||||
|
|
|
@ -99,6 +99,11 @@ def de_vocab():
|
|||
return get_lang_class("de")().vocab
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def dsb_tokenizer():
|
||||
return get_lang_class("dsb")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def el_tokenizer():
|
||||
return get_lang_class("el")().tokenizer
|
||||
|
@ -221,12 +226,30 @@ def ja_tokenizer():
|
|||
return get_lang_class("ja")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def hsb_tokenizer():
|
||||
return get_lang_class("hsb")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ko_tokenizer():
|
||||
pytest.importorskip("natto")
|
||||
return get_lang_class("ko")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ko_tokenizer_tokenizer():
|
||||
config = {
|
||||
"nlp": {
|
||||
"tokenizer": {
|
||||
"@tokenizers": "spacy.Tokenizer.v1",
|
||||
}
|
||||
}
|
||||
}
|
||||
nlp = get_lang_class("ko").from_config(config)
|
||||
return nlp.tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def lb_tokenizer():
|
||||
return get_lang_class("lb")().tokenizer
|
||||
|
@ -334,6 +357,11 @@ def sv_tokenizer():
|
|||
return get_lang_class("sv")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def ta_tokenizer():
|
||||
return get_lang_class("ta")().tokenizer
|
||||
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def th_tokenizer():
|
||||
pytest.importorskip("pythainlp")
|
||||
|
|
|
@ -1,6 +1,7 @@
|
|||
import weakref
|
||||
|
||||
import numpy
|
||||
from numpy.testing import assert_array_equal
|
||||
import pytest
|
||||
from thinc.api import NumpyOps, get_current_ops
|
||||
|
||||
|
@ -634,6 +635,14 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
assert "group" in m_doc.spans
|
||||
assert span_group_texts == sorted([s.text for s in m_doc.spans["group"]])
|
||||
|
||||
# can exclude spans
|
||||
m_doc = Doc.from_docs(en_docs, exclude=["spans"])
|
||||
assert "group" not in m_doc.spans
|
||||
|
||||
# can exclude user_data
|
||||
m_doc = Doc.from_docs(en_docs, exclude=["user_data"])
|
||||
assert m_doc.user_data == {}
|
||||
|
||||
# can merge empty docs
|
||||
doc = Doc.from_docs([en_tokenizer("")] * 10)
|
||||
|
||||
|
@ -647,6 +656,20 @@ def test_doc_api_from_docs(en_tokenizer, de_tokenizer):
|
|||
assert "group" in m_doc.spans
|
||||
assert len(m_doc.spans["group"]) == 0
|
||||
|
||||
# with tensor
|
||||
ops = get_current_ops()
|
||||
for doc in en_docs:
|
||||
doc.tensor = ops.asarray([[len(t.text), 0.0] for t in doc])
|
||||
m_doc = Doc.from_docs(en_docs)
|
||||
assert_array_equal(
|
||||
ops.to_numpy(m_doc.tensor),
|
||||
ops.to_numpy(ops.xp.vstack([doc.tensor for doc in en_docs if len(doc)])),
|
||||
)
|
||||
|
||||
# can exclude tensor
|
||||
m_doc = Doc.from_docs(en_docs, exclude=["tensor"])
|
||||
assert m_doc.tensor.shape == (0,)
|
||||
|
||||
|
||||
def test_doc_api_from_docs_ents(en_tokenizer):
|
||||
texts = ["Merging the docs is fun.", "They don't think alike."]
|
||||
|
@ -684,6 +707,7 @@ def test_has_annotation(en_vocab):
|
|||
attrs = ("TAG", "POS", "MORPH", "LEMMA", "DEP", "HEAD", "ENT_IOB", "ENT_TYPE")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[0].tag_ = "A"
|
||||
doc[0].pos_ = "X"
|
||||
|
@ -709,6 +733,27 @@ def test_has_annotation(en_vocab):
|
|||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_has_annotation_sents(en_vocab):
|
||||
doc = Doc(en_vocab, words=["Hello", "beautiful", "world"])
|
||||
attrs = ("SENT_START", "IS_SENT_START", "IS_SENT_END")
|
||||
for attr in attrs:
|
||||
assert not doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
# The first token (index 0) is always assumed to be a sentence start,
|
||||
# and ignored by the check in doc.has_annotation
|
||||
|
||||
doc[1].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert not doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
doc[2].is_sent_start = False
|
||||
for attr in attrs:
|
||||
assert doc.has_annotation(attr)
|
||||
assert doc.has_annotation(attr, require_complete=True)
|
||||
|
||||
|
||||
def test_is_flags_deprecated(en_tokenizer):
|
||||
doc = en_tokenizer("test")
|
||||
with pytest.deprecated_call():
|
||||
|
|
|
@ -655,3 +655,16 @@ def test_span_sents(doc, start, end, expected_sentences, expected_sentences_with
|
|||
def test_span_sents_not_parsed(doc_not_parsed):
|
||||
with pytest.raises(ValueError):
|
||||
list(Span(doc_not_parsed, 0, 3).sents)
|
||||
|
||||
|
||||
def test_span_group_copy(doc):
|
||||
doc.spans["test"] = [doc[0:1], doc[2:4]]
|
||||
assert len(doc.spans["test"]) == 2
|
||||
doc_copy = doc.copy()
|
||||
# check that the spans were indeed copied
|
||||
assert len(doc_copy.spans["test"]) == 2
|
||||
# add a new span to the original doc
|
||||
doc.spans["test"].append(doc[3:4])
|
||||
assert len(doc.spans["test"]) == 3
|
||||
# check that the copy spans were not modified and this is an isolated doc
|
||||
assert len(doc_copy.spans["test"]) == 2
|
||||
|
|
242
spacy/tests/doc/test_span_group.py
Normal file
242
spacy/tests/doc/test_span_group.py
Normal file
|
@ -0,0 +1,242 @@
|
|||
import pytest
|
||||
from random import Random
|
||||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Span, SpanGroup
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def doc(en_tokenizer):
|
||||
doc = en_tokenizer("0 1 2 3 4 5 6")
|
||||
matcher = Matcher(en_tokenizer.vocab, validate=True)
|
||||
|
||||
# fmt: off
|
||||
matcher.add("4", [[{}, {}, {}, {}]])
|
||||
matcher.add("2", [[{}, {}, ]])
|
||||
matcher.add("1", [[{}, ]])
|
||||
# fmt: on
|
||||
matches = matcher(doc)
|
||||
spans = []
|
||||
for match in matches:
|
||||
spans.append(
|
||||
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
|
||||
)
|
||||
Random(42).shuffle(spans)
|
||||
doc.spans["SPANS"] = SpanGroup(
|
||||
doc, name="SPANS", attrs={"key": "value"}, spans=spans
|
||||
)
|
||||
return doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def other_doc(en_tokenizer):
|
||||
doc = en_tokenizer("0 1 2 3 4 5 6")
|
||||
matcher = Matcher(en_tokenizer.vocab, validate=True)
|
||||
|
||||
# fmt: off
|
||||
matcher.add("4", [[{}, {}, {}, {}]])
|
||||
matcher.add("2", [[{}, {}, ]])
|
||||
matcher.add("1", [[{}, ]])
|
||||
# fmt: on
|
||||
|
||||
matches = matcher(doc)
|
||||
spans = []
|
||||
for match in matches:
|
||||
spans.append(
|
||||
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
|
||||
)
|
||||
Random(42).shuffle(spans)
|
||||
doc.spans["SPANS"] = SpanGroup(
|
||||
doc, name="SPANS", attrs={"key": "value"}, spans=spans
|
||||
)
|
||||
return doc
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def span_group(en_tokenizer):
|
||||
doc = en_tokenizer("0 1 2 3 4 5 6")
|
||||
matcher = Matcher(en_tokenizer.vocab, validate=True)
|
||||
|
||||
# fmt: off
|
||||
matcher.add("4", [[{}, {}, {}, {}]])
|
||||
matcher.add("2", [[{}, {}, ]])
|
||||
matcher.add("1", [[{}, ]])
|
||||
# fmt: on
|
||||
|
||||
matches = matcher(doc)
|
||||
spans = []
|
||||
for match in matches:
|
||||
spans.append(
|
||||
Span(doc, match[1], match[2], en_tokenizer.vocab.strings[match[0]])
|
||||
)
|
||||
Random(42).shuffle(spans)
|
||||
doc.spans["SPANS"] = SpanGroup(
|
||||
doc, name="SPANS", attrs={"key": "value"}, spans=spans
|
||||
)
|
||||
|
||||
|
||||
def test_span_group_copy(doc):
|
||||
span_group = doc.spans["SPANS"]
|
||||
clone = span_group.copy()
|
||||
assert clone != span_group
|
||||
assert clone.name == span_group.name
|
||||
assert clone.attrs == span_group.attrs
|
||||
assert len(clone) == len(span_group)
|
||||
assert list(span_group) == list(clone)
|
||||
clone.name = "new_name"
|
||||
clone.attrs["key"] = "new_value"
|
||||
clone.append(Span(doc, 0, 6, "LABEL"))
|
||||
assert clone.name != span_group.name
|
||||
assert clone.attrs != span_group.attrs
|
||||
assert span_group.attrs["key"] == "value"
|
||||
assert list(span_group) != list(clone)
|
||||
|
||||
|
||||
def test_span_group_set_item(doc, other_doc):
|
||||
span_group = doc.spans["SPANS"]
|
||||
|
||||
index = 5
|
||||
span = span_group[index]
|
||||
span.label_ = "NEW LABEL"
|
||||
span.kb_id = doc.vocab.strings["KB_ID"]
|
||||
|
||||
assert span_group[index].label != span.label
|
||||
assert span_group[index].kb_id != span.kb_id
|
||||
|
||||
span_group[index] = span
|
||||
assert span_group[index].start == span.start
|
||||
assert span_group[index].end == span.end
|
||||
assert span_group[index].label == span.label
|
||||
assert span_group[index].kb_id == span.kb_id
|
||||
assert span_group[index] == span
|
||||
|
||||
with pytest.raises(IndexError):
|
||||
span_group[-100] = span
|
||||
with pytest.raises(IndexError):
|
||||
span_group[100] = span
|
||||
|
||||
span = Span(other_doc, 0, 2)
|
||||
with pytest.raises(ValueError):
|
||||
span_group[index] = span
|
||||
|
||||
|
||||
def test_span_group_has_overlap(doc):
|
||||
span_group = doc.spans["SPANS"]
|
||||
assert span_group.has_overlap
|
||||
|
||||
|
||||
def test_span_group_concat(doc, other_doc):
|
||||
span_group_1 = doc.spans["SPANS"]
|
||||
spans = [doc[0:5], doc[0:6]]
|
||||
span_group_2 = SpanGroup(
|
||||
doc,
|
||||
name="MORE_SPANS",
|
||||
attrs={"key": "new_value", "new_key": "new_value"},
|
||||
spans=spans,
|
||||
)
|
||||
span_group_3 = span_group_1._concat(span_group_2)
|
||||
assert span_group_3.name == span_group_1.name
|
||||
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
|
||||
span_list_expected = list(span_group_1) + list(span_group_2)
|
||||
assert list(span_group_3) == list(span_list_expected)
|
||||
|
||||
# Inplace
|
||||
span_list_expected = list(span_group_1) + list(span_group_2)
|
||||
span_group_3 = span_group_1._concat(span_group_2, inplace=True)
|
||||
assert span_group_3 == span_group_1
|
||||
assert span_group_3.name == span_group_1.name
|
||||
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
|
||||
assert list(span_group_3) == list(span_list_expected)
|
||||
|
||||
span_group_2 = other_doc.spans["SPANS"]
|
||||
with pytest.raises(ValueError):
|
||||
span_group_1._concat(span_group_2)
|
||||
|
||||
|
||||
def test_span_doc_delitem(doc):
|
||||
span_group = doc.spans["SPANS"]
|
||||
length = len(span_group)
|
||||
index = 5
|
||||
span = span_group[index]
|
||||
next_span = span_group[index + 1]
|
||||
del span_group[index]
|
||||
assert len(span_group) == length - 1
|
||||
assert span_group[index] != span
|
||||
assert span_group[index] == next_span
|
||||
|
||||
with pytest.raises(IndexError):
|
||||
del span_group[-100]
|
||||
with pytest.raises(IndexError):
|
||||
del span_group[100]
|
||||
|
||||
|
||||
def test_span_group_add(doc):
|
||||
span_group_1 = doc.spans["SPANS"]
|
||||
spans = [doc[0:5], doc[0:6]]
|
||||
span_group_2 = SpanGroup(
|
||||
doc,
|
||||
name="MORE_SPANS",
|
||||
attrs={"key": "new_value", "new_key": "new_value"},
|
||||
spans=spans,
|
||||
)
|
||||
|
||||
span_group_3_expected = span_group_1._concat(span_group_2)
|
||||
|
||||
span_group_3 = span_group_1 + span_group_2
|
||||
assert len(span_group_3) == len(span_group_3_expected)
|
||||
assert span_group_3.attrs == {"key": "value", "new_key": "new_value"}
|
||||
assert list(span_group_3) == list(span_group_3_expected)
|
||||
|
||||
|
||||
def test_span_group_iadd(doc):
|
||||
span_group_1 = doc.spans["SPANS"].copy()
|
||||
spans = [doc[0:5], doc[0:6]]
|
||||
span_group_2 = SpanGroup(
|
||||
doc,
|
||||
name="MORE_SPANS",
|
||||
attrs={"key": "new_value", "new_key": "new_value"},
|
||||
spans=spans,
|
||||
)
|
||||
|
||||
span_group_1_expected = span_group_1._concat(span_group_2)
|
||||
|
||||
span_group_1 += span_group_2
|
||||
assert len(span_group_1) == len(span_group_1_expected)
|
||||
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
|
||||
assert list(span_group_1) == list(span_group_1_expected)
|
||||
|
||||
span_group_1 = doc.spans["SPANS"].copy()
|
||||
span_group_1 += spans
|
||||
assert len(span_group_1) == len(span_group_1_expected)
|
||||
assert span_group_1.attrs == {
|
||||
"key": "value",
|
||||
}
|
||||
assert list(span_group_1) == list(span_group_1_expected)
|
||||
|
||||
|
||||
def test_span_group_extend(doc):
|
||||
span_group_1 = doc.spans["SPANS"].copy()
|
||||
spans = [doc[0:5], doc[0:6]]
|
||||
span_group_2 = SpanGroup(
|
||||
doc,
|
||||
name="MORE_SPANS",
|
||||
attrs={"key": "new_value", "new_key": "new_value"},
|
||||
spans=spans,
|
||||
)
|
||||
|
||||
span_group_1_expected = span_group_1._concat(span_group_2)
|
||||
|
||||
span_group_1.extend(span_group_2)
|
||||
assert len(span_group_1) == len(span_group_1_expected)
|
||||
assert span_group_1.attrs == {"key": "value", "new_key": "new_value"}
|
||||
assert list(span_group_1) == list(span_group_1_expected)
|
||||
|
||||
span_group_1 = doc.spans["SPANS"]
|
||||
span_group_1.extend(spans)
|
||||
assert len(span_group_1) == len(span_group_1_expected)
|
||||
assert span_group_1.attrs == {"key": "value"}
|
||||
assert list(span_group_1) == list(span_group_1_expected)
|
||||
|
||||
|
||||
def test_span_group_dealloc(span_group):
|
||||
with pytest.raises(AttributeError):
|
||||
print(span_group.doc)
|
|
@ -1,5 +1,5 @@
|
|||
import pytest
|
||||
from spacy.tokens import Doc
|
||||
from spacy.tokens import Doc, Span
|
||||
|
||||
|
||||
@pytest.fixture()
|
||||
|
@ -60,3 +60,13 @@ def test_doc_to_json_underscore_error_serialize(doc):
|
|||
Doc.set_extension("json_test4", method=lambda doc: doc.text)
|
||||
with pytest.raises(ValueError):
|
||||
doc.to_json(underscore=["json_test4"])
|
||||
|
||||
|
||||
def test_doc_to_json_span(doc):
|
||||
"""Test that Doc.to_json() includes spans"""
|
||||
doc.spans["test"] = [Span(doc, 0, 2, "test"), Span(doc, 0, 1, "test")]
|
||||
json_doc = doc.to_json()
|
||||
assert "spans" in json_doc
|
||||
assert len(json_doc["spans"]) == 1
|
||||
assert len(json_doc["spans"]["test"]) == 2
|
||||
assert json_doc["spans"]["test"][0]["start"] == 0
|
||||
|
|
0
spacy/tests/lang/dsb/__init__.py
Normal file
0
spacy/tests/lang/dsb/__init__.py
Normal file
25
spacy/tests/lang/dsb/test_text.py
Normal file
25
spacy/tests/lang/dsb/test_text.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,match",
|
||||
[
|
||||
("10", True),
|
||||
("1", True),
|
||||
("10,000", True),
|
||||
("10,00", True),
|
||||
("jadno", True),
|
||||
("dwanassćo", True),
|
||||
("milion", True),
|
||||
("sto", True),
|
||||
("ceła", False),
|
||||
("kopica", False),
|
||||
("narěcow", False),
|
||||
(",", False),
|
||||
("1/2", True),
|
||||
],
|
||||
)
|
||||
def test_lex_attrs_like_number(dsb_tokenizer, text, match):
|
||||
tokens = dsb_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
29
spacy/tests/lang/dsb/test_tokenizer.py
Normal file
29
spacy/tests/lang/dsb/test_tokenizer.py
Normal file
|
@ -0,0 +1,29 @@
|
|||
import pytest
|
||||
|
||||
DSB_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Ale eksistěrujo mimo togo ceła kopica narěcow, ako na pśikład slěpjańska.",
|
||||
[
|
||||
"Ale",
|
||||
"eksistěrujo",
|
||||
"mimo",
|
||||
"togo",
|
||||
"ceła",
|
||||
"kopica",
|
||||
"narěcow",
|
||||
",",
|
||||
"ako",
|
||||
"na",
|
||||
"pśikład",
|
||||
"slěpjańska",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", DSB_BASIC_TOKENIZATION_TESTS)
|
||||
def test_dsb_tokenizer_basic(dsb_tokenizer, text, expected_tokens):
|
||||
tokens = dsb_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -107,7 +107,17 @@ FI_NP_TEST_EXAMPLES = [
|
|||
(
|
||||
"New York tunnetaan kaupunkina, joka ei koskaan nuku",
|
||||
["PROPN", "PROPN", "VERB", "NOUN", "PUNCT", "PRON", "AUX", "ADV", "VERB"],
|
||||
["obj", "flat:name", "ROOT", "obl", "punct", "nsubj", "aux", "advmod", "acl:relcl"],
|
||||
[
|
||||
"obj",
|
||||
"flat:name",
|
||||
"ROOT",
|
||||
"obl",
|
||||
"punct",
|
||||
"nsubj",
|
||||
"aux",
|
||||
"advmod",
|
||||
"acl:relcl",
|
||||
],
|
||||
[2, -1, 0, -1, 4, 3, 2, 1, -5],
|
||||
["New York", "kaupunkina"],
|
||||
),
|
||||
|
@ -130,7 +140,12 @@ FI_NP_TEST_EXAMPLES = [
|
|||
["NOUN", "VERB", "NOUN", "NOUN", "ADJ", "NOUN"],
|
||||
["nsubj", "ROOT", "obj", "obl", "amod", "obl"],
|
||||
[1, 0, -1, -1, 1, -3],
|
||||
["sairaanhoitopiirit", "leikkaustoimintaa", "alueellaan", "useammassa sairaalassa"],
|
||||
[
|
||||
"sairaanhoitopiirit",
|
||||
"leikkaustoimintaa",
|
||||
"alueellaan",
|
||||
"useammassa sairaalassa",
|
||||
],
|
||||
),
|
||||
(
|
||||
"Lain mukaan varhaiskasvatus on suunnitelmallista toimintaa",
|
||||
|
|
0
spacy/tests/lang/hsb/__init__.py
Normal file
0
spacy/tests/lang/hsb/__init__.py
Normal file
25
spacy/tests/lang/hsb/test_text.py
Normal file
25
spacy/tests/lang/hsb/test_text.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
import pytest
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,match",
|
||||
[
|
||||
("10", True),
|
||||
("1", True),
|
||||
("10,000", True),
|
||||
("10,00", True),
|
||||
("jedne", True),
|
||||
("dwanaće", True),
|
||||
("milion", True),
|
||||
("sto", True),
|
||||
("załožene", False),
|
||||
("wona", False),
|
||||
("powšitkownej", False),
|
||||
(",", False),
|
||||
("1/2", True),
|
||||
],
|
||||
)
|
||||
def test_lex_attrs_like_number(hsb_tokenizer, text, match):
|
||||
tokens = hsb_tokenizer(text)
|
||||
assert len(tokens) == 1
|
||||
assert tokens[0].like_num == match
|
32
spacy/tests/lang/hsb/test_tokenizer.py
Normal file
32
spacy/tests/lang/hsb/test_tokenizer.py
Normal file
|
@ -0,0 +1,32 @@
|
|||
import pytest
|
||||
|
||||
HSB_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"Hornjoserbšćina wobsteji resp. wobsteješe z wjacorych dialektow, kotrež so zdźěla chětro wot so rozeznawachu.",
|
||||
[
|
||||
"Hornjoserbšćina",
|
||||
"wobsteji",
|
||||
"resp.",
|
||||
"wobsteješe",
|
||||
"z",
|
||||
"wjacorych",
|
||||
"dialektow",
|
||||
",",
|
||||
"kotrež",
|
||||
"so",
|
||||
"zdźěla",
|
||||
"chětro",
|
||||
"wot",
|
||||
"so",
|
||||
"rozeznawachu",
|
||||
".",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", HSB_BASIC_TOKENIZATION_TESTS)
|
||||
def test_hsb_tokenizer_basic(hsb_tokenizer, text, expected_tokens):
|
||||
tokens = hsb_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
assert expected_tokens == token_list
|
|
@ -47,3 +47,29 @@ def test_ko_tokenizer_pos(ko_tokenizer, text, expected_pos):
|
|||
def test_ko_empty_doc(ko_tokenizer):
|
||||
tokens = ko_tokenizer("")
|
||||
assert len(tokens) == 0
|
||||
|
||||
|
||||
@pytest.mark.issue(10535)
|
||||
def test_ko_tokenizer_unknown_tag(ko_tokenizer):
|
||||
tokens = ko_tokenizer("미닛 리피터")
|
||||
assert tokens[1].pos_ == "X"
|
||||
|
||||
|
||||
# fmt: off
|
||||
SPACY_TOKENIZER_TESTS = [
|
||||
("있다.", "있다 ."),
|
||||
("'예'는", "' 예 ' 는"),
|
||||
("부 (富) 는", "부 ( 富 ) 는"),
|
||||
("부(富)는", "부 ( 富 ) 는"),
|
||||
("1982~1983.", "1982 ~ 1983 ."),
|
||||
("사과·배·복숭아·수박은 모두 과일이다.", "사과 · 배 · 복숭아 · 수박은 모두 과일이다 ."),
|
||||
("그렇구나~", "그렇구나~"),
|
||||
("『9시 반의 당구』,", "『 9시 반의 당구 』 ,"),
|
||||
]
|
||||
# fmt: on
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", SPACY_TOKENIZER_TESTS)
|
||||
def test_ko_spacy_tokenizer(ko_tokenizer_tokenizer, text, expected_tokens):
|
||||
tokens = [token.text for token in ko_tokenizer_tokenizer(text)]
|
||||
assert tokens == expected_tokens.split()
|
||||
|
|
0
spacy/tests/lang/ta/__init__.py
Normal file
0
spacy/tests/lang/ta/__init__.py
Normal file
25
spacy/tests/lang/ta/test_text.py
Normal file
25
spacy/tests/lang/ta/test_text.py
Normal file
|
@ -0,0 +1,25 @@
|
|||
import pytest
|
||||
from spacy.lang.ta import Tamil
|
||||
|
||||
# Wikipedia excerpt: https://en.wikipedia.org/wiki/Chennai (Tamil Language)
|
||||
TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT = """சென்னை (Chennai) தமிழ்நாட்டின் தலைநகரமும், இந்தியாவின் நான்காவது பெரிய நகரமும் ஆகும். 1996 ஆம் ஆண்டுக்கு முன்னர் இந்நகரம், மதராசு பட்டினம், மெட்ராஸ் (Madras) மற்றும் சென்னப்பட்டினம் என்றும் அழைக்கப்பட்டு வந்தது. சென்னை, வங்காள விரிகுடாவின் கரையில் அமைந்த துறைமுக நகரங்களுள் ஒன்று. சுமார் 10 மில்லியன் (ஒரு கோடி) மக்கள் வாழும் இந்நகரம், உலகின் 35 பெரிய மாநகரங்களுள் ஒன்று. 17ஆம் நூற்றாண்டில் ஆங்கிலேயர் சென்னையில் கால் பதித்தது முதல், சென்னை நகரம் ஒரு முக்கிய நகரமாக வளர்ந்து வந்திருக்கிறது. சென்னை தென்னிந்தியாவின் வாசலாகக் கருதப்படுகிறது. சென்னை நகரில் உள்ள மெரினா கடற்கரை உலகின் நீளமான கடற்கரைகளுள் ஒன்று. சென்னை கோலிவுட் (Kollywood) என அறியப்படும் தமிழ்த் திரைப்படத் துறையின் தாயகம் ஆகும். பல விளையாட்டு அரங்கங்கள் உள்ள சென்னையில் பல விளையாட்டுப் போட்டிகளும் நடைபெறுகின்றன."""
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text, num_tokens",
|
||||
[(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 23 + 90)], # Punctuation + rest
|
||||
)
|
||||
def test_long_text(ta_tokenizer, text, num_tokens):
|
||||
tokens = ta_tokenizer(text)
|
||||
assert len(tokens) == num_tokens
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text, num_sents", [(TAMIL_BASIC_TOKENIZER_SENTENCIZER_TEST_TEXT, 9)]
|
||||
)
|
||||
def test_ta_sentencizer(text, num_sents):
|
||||
nlp = Tamil()
|
||||
nlp.add_pipe("sentencizer")
|
||||
|
||||
doc = nlp(text)
|
||||
assert len(list(doc.sents)) == num_sents
|
188
spacy/tests/lang/ta/test_tokenizer.py
Normal file
188
spacy/tests/lang/ta/test_tokenizer.py
Normal file
|
@ -0,0 +1,188 @@
|
|||
import pytest
|
||||
from spacy.symbols import ORTH
|
||||
from spacy.lang.ta import Tamil
|
||||
|
||||
TA_BASIC_TOKENIZATION_TESTS = [
|
||||
(
|
||||
"கிறிஸ்துமஸ் மற்றும் இனிய புத்தாண்டு வாழ்த்துக்கள்",
|
||||
["கிறிஸ்துமஸ்", "மற்றும்", "இனிய", "புத்தாண்டு", "வாழ்த்துக்கள்"],
|
||||
),
|
||||
(
|
||||
"எனக்கு என் குழந்தைப் பருவம் நினைவிருக்கிறது",
|
||||
["எனக்கு", "என்", "குழந்தைப்", "பருவம்", "நினைவிருக்கிறது"],
|
||||
),
|
||||
("உங்கள் பெயர் என்ன?", ["உங்கள்", "பெயர்", "என்ன", "?"]),
|
||||
(
|
||||
"ஏறத்தாழ இலங்கைத் தமிழரில் மூன்றிலொரு பங்கினர் இலங்கையை விட்டு வெளியேறிப் பிற நாடுகளில் வாழ்கின்றனர்",
|
||||
[
|
||||
"ஏறத்தாழ",
|
||||
"இலங்கைத்",
|
||||
"தமிழரில்",
|
||||
"மூன்றிலொரு",
|
||||
"பங்கினர்",
|
||||
"இலங்கையை",
|
||||
"விட்டு",
|
||||
"வெளியேறிப்",
|
||||
"பிற",
|
||||
"நாடுகளில்",
|
||||
"வாழ்கின்றனர்",
|
||||
],
|
||||
),
|
||||
(
|
||||
"இந்த ஃபோனுடன் சுமார் ரூ.2,990 மதிப்புள்ள போட் ராக்கர்ஸ் நிறுவனத்தின் ஸ்போர்ட் புளூடூத் ஹெட்போன்ஸ் இலவசமாக வழங்கப்படவுள்ளது.",
|
||||
[
|
||||
"இந்த",
|
||||
"ஃபோனுடன்",
|
||||
"சுமார்",
|
||||
"ரூ.2,990",
|
||||
"மதிப்புள்ள",
|
||||
"போட்",
|
||||
"ராக்கர்ஸ்",
|
||||
"நிறுவனத்தின்",
|
||||
"ஸ்போர்ட்",
|
||||
"புளூடூத்",
|
||||
"ஹெட்போன்ஸ்",
|
||||
"இலவசமாக",
|
||||
"வழங்கப்படவுள்ளது",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"மட்டக்களப்பில் பல இடங்களில் வீட்டுத் திட்டங்களுக்கு இன்று அடிக்கல் நாட்டல்",
|
||||
[
|
||||
"மட்டக்களப்பில்",
|
||||
"பல",
|
||||
"இடங்களில்",
|
||||
"வீட்டுத்",
|
||||
"திட்டங்களுக்கு",
|
||||
"இன்று",
|
||||
"அடிக்கல்",
|
||||
"நாட்டல்",
|
||||
],
|
||||
),
|
||||
(
|
||||
"ஐ போன்க்கு முகத்தை வைத்து அன்லாக் செய்யும் முறை மற்றும் விரலால் தொட்டு அன்லாக் செய்யும் முறையை வாட்ஸ் ஆப் நிறுவனம் இதற்கு முன் கண்டுபிடித்தது",
|
||||
[
|
||||
"ஐ",
|
||||
"போன்க்கு",
|
||||
"முகத்தை",
|
||||
"வைத்து",
|
||||
"அன்லாக்",
|
||||
"செய்யும்",
|
||||
"முறை",
|
||||
"மற்றும்",
|
||||
"விரலால்",
|
||||
"தொட்டு",
|
||||
"அன்லாக்",
|
||||
"செய்யும்",
|
||||
"முறையை",
|
||||
"வாட்ஸ்",
|
||||
"ஆப்",
|
||||
"நிறுவனம்",
|
||||
"இதற்கு",
|
||||
"முன்",
|
||||
"கண்டுபிடித்தது",
|
||||
],
|
||||
),
|
||||
(
|
||||
"இது ஒரு வாக்கியம்.",
|
||||
[
|
||||
"இது",
|
||||
"ஒரு",
|
||||
"வாக்கியம்",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"தன்னாட்சி கார்கள் காப்பீட்டு பொறுப்பை உற்பத்தியாளரிடம் மாற்றுகின்றன",
|
||||
[
|
||||
"தன்னாட்சி",
|
||||
"கார்கள்",
|
||||
"காப்பீட்டு",
|
||||
"பொறுப்பை",
|
||||
"உற்பத்தியாளரிடம்",
|
||||
"மாற்றுகின்றன",
|
||||
],
|
||||
),
|
||||
(
|
||||
"நடைபாதை விநியோக ரோபோக்களை தடை செய்வதை சான் பிரான்சிஸ்கோ கருதுகிறது",
|
||||
[
|
||||
"நடைபாதை",
|
||||
"விநியோக",
|
||||
"ரோபோக்களை",
|
||||
"தடை",
|
||||
"செய்வதை",
|
||||
"சான்",
|
||||
"பிரான்சிஸ்கோ",
|
||||
"கருதுகிறது",
|
||||
],
|
||||
),
|
||||
(
|
||||
"லண்டன் ஐக்கிய இராச்சியத்தில் ஒரு பெரிய நகரம்.",
|
||||
[
|
||||
"லண்டன்",
|
||||
"ஐக்கிய",
|
||||
"இராச்சியத்தில்",
|
||||
"ஒரு",
|
||||
"பெரிய",
|
||||
"நகரம்",
|
||||
".",
|
||||
],
|
||||
),
|
||||
(
|
||||
"என்ன வேலை செய்கிறீர்கள்?",
|
||||
[
|
||||
"என்ன",
|
||||
"வேலை",
|
||||
"செய்கிறீர்கள்",
|
||||
"?",
|
||||
],
|
||||
),
|
||||
(
|
||||
"எந்த கல்லூரியில் படிக்கிறாய்?",
|
||||
[
|
||||
"எந்த",
|
||||
"கல்லூரியில்",
|
||||
"படிக்கிறாய்",
|
||||
"?",
|
||||
],
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
@pytest.mark.parametrize("text,expected_tokens", TA_BASIC_TOKENIZATION_TESTS)
|
||||
def test_ta_tokenizer_basic(ta_tokenizer, text, expected_tokens):
|
||||
tokens = ta_tokenizer(text)
|
||||
token_list = [token.text for token in tokens]
|
||||
assert expected_tokens == token_list
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"text,expected_tokens",
|
||||
[
|
||||
(
|
||||
"ஆப்பிள் நிறுவனம் யு.கே. தொடக்க நிறுவனத்தை ஒரு லட்சம் கோடிக்கு வாங்கப் பார்க்கிறது",
|
||||
[
|
||||
"ஆப்பிள்",
|
||||
"நிறுவனம்",
|
||||
"யு.கே.",
|
||||
"தொடக்க",
|
||||
"நிறுவனத்தை",
|
||||
"ஒரு",
|
||||
"லட்சம்",
|
||||
"கோடிக்கு",
|
||||
"வாங்கப்",
|
||||
"பார்க்கிறது",
|
||||
],
|
||||
)
|
||||
],
|
||||
)
|
||||
def test_ta_tokenizer_special_case(text, expected_tokens):
|
||||
# Add a special rule to tokenize the initialism "யு.கே." (U.K., as
|
||||
# in the country) as a single token.
|
||||
nlp = Tamil()
|
||||
nlp.tokenizer.add_special_case("யு.கே.", [{ORTH: "யு.கே."}])
|
||||
tokens = nlp(text)
|
||||
|
||||
token_list = [token.text for token in tokens]
|
||||
assert expected_tokens == token_list
|
|
@ -41,7 +41,7 @@ def test_tr_lex_attrs_like_number_cardinal_ordinal(word):
|
|||
assert like_num(word)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("word", ["beş", "yedi", "yedinci", "birinci"])
|
||||
@pytest.mark.parametrize("word", ["beş", "yedi", "yedinci", "birinci", "milyonuncu"])
|
||||
def test_tr_lex_attrs_capitals(word):
|
||||
assert like_num(word)
|
||||
assert like_num(word.upper())
|
||||
|
|
|
@ -694,5 +694,4 @@ TESTS = ABBREV_TESTS + URL_TESTS + NUMBER_TESTS + PUNCT_TESTS + GENERAL_TESTS
|
|||
def test_tr_tokenizer_handles_allcases(tr_tokenizer, text, expected_tokens):
|
||||
tokens = tr_tokenizer(text)
|
||||
token_list = [token.text for token in tokens if not token.is_space]
|
||||
print(token_list)
|
||||
assert expected_tokens == token_list
|
||||
|
|
|
@ -12,6 +12,7 @@ def test_build_dependencies():
|
|||
"flake8",
|
||||
"hypothesis",
|
||||
"pre-commit",
|
||||
"black",
|
||||
"mypy",
|
||||
"types-dataclasses",
|
||||
"types-mock",
|
||||
|
|
|
@ -93,8 +93,8 @@ def test_parser_pseudoprojectivity(en_vocab):
|
|||
assert nonproj.is_decorated("X") is False
|
||||
nonproj._lift(0, tree)
|
||||
assert tree == [2, 2, 2]
|
||||
assert nonproj._get_smallest_nonproj_arc(nonproj_tree) == 7
|
||||
assert nonproj._get_smallest_nonproj_arc(nonproj_tree2) == 10
|
||||
assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree) == 7
|
||||
assert nonproj.get_smallest_nonproj_arc_slow(nonproj_tree2) == 10
|
||||
# fmt: off
|
||||
proj_heads, deco_labels = nonproj.projectivize(nonproj_tree, labels)
|
||||
assert proj_heads == [1, 2, 2, 4, 5, 2, 7, 5, 2]
|
||||
|
|
280
spacy/tests/pipeline/test_edit_tree_lemmatizer.py
Normal file
280
spacy/tests/pipeline/test_edit_tree_lemmatizer.py
Normal file
|
@ -0,0 +1,280 @@
|
|||
import pickle
|
||||
import pytest
|
||||
from hypothesis import given
|
||||
import hypothesis.strategies as st
|
||||
from spacy import util
|
||||
from spacy.lang.en import English
|
||||
from spacy.language import Language
|
||||
from spacy.pipeline._edit_tree_internals.edit_trees import EditTrees
|
||||
from spacy.training import Example
|
||||
from spacy.strings import StringStore
|
||||
from spacy.util import make_tempdir
|
||||
|
||||
|
||||
TRAIN_DATA = [
|
||||
("She likes green eggs", {"lemmas": ["she", "like", "green", "egg"]}),
|
||||
("Eat blue ham", {"lemmas": ["eat", "blue", "ham"]}),
|
||||
]
|
||||
|
||||
PARTIAL_DATA = [
|
||||
# partial annotation
|
||||
("She likes green eggs", {"lemmas": ["", "like", "green", ""]}),
|
||||
# misaligned partial annotation
|
||||
(
|
||||
"He hates green eggs",
|
||||
{
|
||||
"words": ["He", "hat", "es", "green", "eggs"],
|
||||
"lemmas": ["", "hat", "e", "green", ""],
|
||||
},
|
||||
),
|
||||
]
|
||||
|
||||
|
||||
def test_initialize_examples():
|
||||
nlp = Language()
|
||||
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
# you shouldn't really call this more than once, but for testing it should be fine
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
with pytest.raises(TypeError):
|
||||
nlp.initialize(get_examples=lambda: None)
|
||||
with pytest.raises(TypeError):
|
||||
nlp.initialize(get_examples=lambda: train_examples[0])
|
||||
with pytest.raises(TypeError):
|
||||
nlp.initialize(get_examples=lambda: [])
|
||||
with pytest.raises(TypeError):
|
||||
nlp.initialize(get_examples=train_examples)
|
||||
|
||||
|
||||
def test_initialize_from_labels():
|
||||
nlp = Language()
|
||||
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer.min_tree_freq = 1
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
nlp2 = Language()
|
||||
lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer2.initialize(
|
||||
get_examples=lambda: train_examples,
|
||||
labels=lemmatizer.label_data,
|
||||
)
|
||||
assert lemmatizer2.tree2label == {1: 0, 3: 1, 4: 2, 6: 3}
|
||||
|
||||
|
||||
def test_no_data():
|
||||
# Test that the lemmatizer provides a nice error when there's no tagging data / labels
|
||||
TEXTCAT_DATA = [
|
||||
("I'm so happy.", {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}),
|
||||
("I'm so angry", {"cats": {"POSITIVE": 0.0, "NEGATIVE": 1.0}}),
|
||||
]
|
||||
nlp = English()
|
||||
nlp.add_pipe("trainable_lemmatizer")
|
||||
nlp.add_pipe("textcat")
|
||||
|
||||
train_examples = []
|
||||
for t in TEXTCAT_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
with pytest.raises(ValueError):
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
|
||||
def test_incomplete_data():
|
||||
# Test that the lemmatizer works with incomplete information
|
||||
nlp = English()
|
||||
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer.min_tree_freq = 1
|
||||
train_examples = []
|
||||
for t in PARTIAL_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["trainable_lemmatizer"] < 0.00001
|
||||
|
||||
# test the trained model
|
||||
test_text = "She likes blue eggs"
|
||||
doc = nlp(test_text)
|
||||
assert doc[1].lemma_ == "like"
|
||||
assert doc[2].lemma_ == "blue"
|
||||
|
||||
|
||||
def test_overfitting_IO():
|
||||
nlp = English()
|
||||
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer.min_tree_freq = 1
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
assert losses["trainable_lemmatizer"] < 0.00001
|
||||
|
||||
test_text = "She likes blue eggs"
|
||||
doc = nlp(test_text)
|
||||
assert doc[0].lemma_ == "she"
|
||||
assert doc[1].lemma_ == "like"
|
||||
assert doc[2].lemma_ == "blue"
|
||||
assert doc[3].lemma_ == "egg"
|
||||
|
||||
# Check model after a {to,from}_disk roundtrip
|
||||
with util.make_tempdir() as tmp_dir:
|
||||
nlp.to_disk(tmp_dir)
|
||||
nlp2 = util.load_model_from_path(tmp_dir)
|
||||
doc2 = nlp2(test_text)
|
||||
assert doc2[0].lemma_ == "she"
|
||||
assert doc2[1].lemma_ == "like"
|
||||
assert doc2[2].lemma_ == "blue"
|
||||
assert doc2[3].lemma_ == "egg"
|
||||
|
||||
# Check model after a {to,from}_bytes roundtrip
|
||||
nlp_bytes = nlp.to_bytes()
|
||||
nlp3 = English()
|
||||
nlp3.add_pipe("trainable_lemmatizer")
|
||||
nlp3.from_bytes(nlp_bytes)
|
||||
doc3 = nlp3(test_text)
|
||||
assert doc3[0].lemma_ == "she"
|
||||
assert doc3[1].lemma_ == "like"
|
||||
assert doc3[2].lemma_ == "blue"
|
||||
assert doc3[3].lemma_ == "egg"
|
||||
|
||||
# Check model after a pickle roundtrip.
|
||||
nlp_bytes = pickle.dumps(nlp)
|
||||
nlp4 = pickle.loads(nlp_bytes)
|
||||
doc4 = nlp4(test_text)
|
||||
assert doc4[0].lemma_ == "she"
|
||||
assert doc4[1].lemma_ == "like"
|
||||
assert doc4[2].lemma_ == "blue"
|
||||
assert doc4[3].lemma_ == "egg"
|
||||
|
||||
|
||||
def test_lemmatizer_requires_labels():
|
||||
nlp = English()
|
||||
nlp.add_pipe("trainable_lemmatizer")
|
||||
with pytest.raises(ValueError):
|
||||
nlp.initialize()
|
||||
|
||||
|
||||
def test_lemmatizer_label_data():
|
||||
nlp = English()
|
||||
lemmatizer = nlp.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer.min_tree_freq = 1
|
||||
train_examples = []
|
||||
for t in TRAIN_DATA:
|
||||
train_examples.append(Example.from_dict(nlp.make_doc(t[0]), t[1]))
|
||||
|
||||
nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
nlp2 = English()
|
||||
lemmatizer2 = nlp2.add_pipe("trainable_lemmatizer")
|
||||
lemmatizer2.initialize(
|
||||
get_examples=lambda: train_examples, labels=lemmatizer.label_data
|
||||
)
|
||||
|
||||
# Verify that the labels and trees are the same.
|
||||
assert lemmatizer.labels == lemmatizer2.labels
|
||||
assert lemmatizer.trees.to_bytes() == lemmatizer2.trees.to_bytes()
|
||||
|
||||
|
||||
def test_dutch():
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
tree = trees.add("deelt", "delen")
|
||||
assert trees.tree_to_str(tree) == "(m 0 3 () (m 0 2 (s '' 'l') (s 'lt' 'n')))"
|
||||
|
||||
tree = trees.add("gedeeld", "delen")
|
||||
assert (
|
||||
trees.tree_to_str(tree) == "(m 2 3 (s 'ge' '') (m 0 2 (s '' 'l') (s 'ld' 'n')))"
|
||||
)
|
||||
|
||||
|
||||
def test_from_to_bytes():
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
trees.add("deelt", "delen")
|
||||
trees.add("gedeeld", "delen")
|
||||
|
||||
b = trees.to_bytes()
|
||||
|
||||
trees2 = EditTrees(strings)
|
||||
trees2.from_bytes(b)
|
||||
|
||||
# Verify that the nodes did not change.
|
||||
assert len(trees) == len(trees2)
|
||||
for i in range(len(trees)):
|
||||
assert trees.tree_to_str(i) == trees2.tree_to_str(i)
|
||||
|
||||
# Reinserting the same trees should not add new nodes.
|
||||
trees2.add("deelt", "delen")
|
||||
trees2.add("gedeeld", "delen")
|
||||
assert len(trees) == len(trees2)
|
||||
|
||||
|
||||
def test_from_to_disk():
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
trees.add("deelt", "delen")
|
||||
trees.add("gedeeld", "delen")
|
||||
|
||||
trees2 = EditTrees(strings)
|
||||
with make_tempdir() as temp_dir:
|
||||
trees_file = temp_dir / "edit_trees.bin"
|
||||
trees.to_disk(trees_file)
|
||||
trees2 = trees2.from_disk(trees_file)
|
||||
|
||||
# Verify that the nodes did not change.
|
||||
assert len(trees) == len(trees2)
|
||||
for i in range(len(trees)):
|
||||
assert trees.tree_to_str(i) == trees2.tree_to_str(i)
|
||||
|
||||
# Reinserting the same trees should not add new nodes.
|
||||
trees2.add("deelt", "delen")
|
||||
trees2.add("gedeeld", "delen")
|
||||
assert len(trees) == len(trees2)
|
||||
|
||||
|
||||
@given(st.text(), st.text())
|
||||
def test_roundtrip(form, lemma):
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
tree = trees.add(form, lemma)
|
||||
assert trees.apply(tree, form) == lemma
|
||||
|
||||
|
||||
@given(st.text(alphabet="ab"), st.text(alphabet="ab"))
|
||||
def test_roundtrip_small_alphabet(form, lemma):
|
||||
# Test with small alphabets to have more overlap.
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
tree = trees.add(form, lemma)
|
||||
assert trees.apply(tree, form) == lemma
|
||||
|
||||
|
||||
def test_unapplicable_trees():
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
tree3 = trees.add("deelt", "delen")
|
||||
|
||||
# Replacement fails.
|
||||
assert trees.apply(tree3, "deeld") == None
|
||||
|
||||
# Suffix + prefix are too large.
|
||||
assert trees.apply(tree3, "de") == None
|
||||
|
||||
|
||||
def test_empty_strings():
|
||||
strings = StringStore()
|
||||
trees = EditTrees(strings)
|
||||
no_change = trees.add("xyz", "xyz")
|
||||
empty = trees.add("", "")
|
||||
assert no_change == empty
|
|
@ -9,6 +9,9 @@ from spacy.compat import pickle
|
|||
from spacy.kb import Candidate, KnowledgeBase, get_candidates
|
||||
from spacy.lang.en import English
|
||||
from spacy.ml import load_kb
|
||||
from spacy.pipeline import EntityLinker
|
||||
from spacy.pipeline.legacy import EntityLinker_v1
|
||||
from spacy.pipeline.tok2vec import DEFAULT_TOK2VEC_MODEL
|
||||
from spacy.scorer import Scorer
|
||||
from spacy.tests.util import make_tempdir
|
||||
from spacy.tokens import Span
|
||||
|
@ -168,6 +171,45 @@ def test_issue7065_b():
|
|||
assert doc
|
||||
|
||||
|
||||
def test_no_entities():
|
||||
# Test that having no entities doesn't crash the model
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"The sky is blue.",
|
||||
{
|
||||
"sent_starts": [1, 0, 0, 0, 0],
|
||||
},
|
||||
)
|
||||
]
|
||||
nlp = English()
|
||||
vector_length = 3
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
train_examples.append(Example.from_dict(doc, annotation))
|
||||
|
||||
def create_kb(vocab):
|
||||
# create artificial KB
|
||||
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||
mykb.add_alias("Russ Cochran", ["Q2146908"], [0.9])
|
||||
return mykb
|
||||
|
||||
# Create and train the Entity Linker
|
||||
entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||
entity_linker.set_kb(create_kb)
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
for i in range(2):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
# adding additional components that are required for the entity_linker
|
||||
nlp.add_pipe("sentencizer", first=True)
|
||||
|
||||
# this will run the pipeline on the examples and shouldn't crash
|
||||
results = nlp.evaluate(train_examples)
|
||||
|
||||
|
||||
def test_partial_links():
|
||||
# Test that having some entities on the doc without gold links, doesn't crash
|
||||
TRAIN_DATA = [
|
||||
|
@ -650,7 +692,7 @@ TRAIN_DATA = [
|
|||
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}),
|
||||
("Russ Cochran his reprints include EC Comics.",
|
||||
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
||||
"entities": [(0, 12, "PERSON")],
|
||||
"entities": [(0, 12, "PERSON"), (34, 43, "ART")],
|
||||
"sent_starts": [1, -1, 0, 0, 0, 0, 0, 0]}),
|
||||
("Russ Cochran has been publishing comic art.",
|
||||
{"links": {(0, 12): {"Q7381115": 1.0, "Q2146908": 0.0}},
|
||||
|
@ -693,6 +735,7 @@ def test_overfitting_IO():
|
|||
|
||||
# Create the Entity Linker component and add it to the pipeline
|
||||
entity_linker = nlp.add_pipe("entity_linker", last=True)
|
||||
assert isinstance(entity_linker, EntityLinker)
|
||||
entity_linker.set_kb(create_kb)
|
||||
assert "Q2146908" in entity_linker.vocab.strings
|
||||
assert "Q2146908" in entity_linker.kb.vocab.strings
|
||||
|
@ -922,3 +965,113 @@ def test_scorer_links():
|
|||
|
||||
assert scores["nel_micro_p"] == 2 / 3
|
||||
assert scores["nel_micro_r"] == 2 / 4
|
||||
|
||||
|
||||
# fmt: off
|
||||
@pytest.mark.parametrize(
|
||||
"name,config",
|
||||
[
|
||||
("entity_linker", {"@architectures": "spacy.EntityLinker.v1", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
|
||||
("entity_linker", {"@architectures": "spacy.EntityLinker.v2", "tok2vec": DEFAULT_TOK2VEC_MODEL}),
|
||||
],
|
||||
)
|
||||
# fmt: on
|
||||
def test_legacy_architectures(name, config):
|
||||
# Ensure that the legacy architectures still work
|
||||
vector_length = 3
|
||||
nlp = English()
|
||||
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
doc = nlp.make_doc(text)
|
||||
train_examples.append(Example.from_dict(doc, annotation))
|
||||
|
||||
def create_kb(vocab):
|
||||
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||
mykb.add_entity(entity="Q2146908", freq=12, entity_vector=[6, -4, 3])
|
||||
mykb.add_entity(entity="Q7381115", freq=12, entity_vector=[9, 1, -7])
|
||||
mykb.add_alias(
|
||||
alias="Russ Cochran",
|
||||
entities=["Q2146908", "Q7381115"],
|
||||
probabilities=[0.5, 0.5],
|
||||
)
|
||||
return mykb
|
||||
|
||||
entity_linker = nlp.add_pipe(name, config={"model": config})
|
||||
if config["@architectures"] == "spacy.EntityLinker.v1":
|
||||
assert isinstance(entity_linker, EntityLinker_v1)
|
||||
else:
|
||||
assert isinstance(entity_linker, EntityLinker)
|
||||
entity_linker.set_kb(create_kb)
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
|
||||
for i in range(2):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
|
||||
@pytest.mark.parametrize(
|
||||
"patterns",
|
||||
[
|
||||
# perfect case
|
||||
[{"label": "CHARACTER", "pattern": "Kirby"}],
|
||||
# typo for false negative
|
||||
[{"label": "PERSON", "pattern": "Korby"}],
|
||||
# random stuff for false positive
|
||||
[{"label": "IS", "pattern": "is"}, {"label": "COLOR", "pattern": "pink"}],
|
||||
],
|
||||
)
|
||||
def test_no_gold_ents(patterns):
|
||||
# test that annotating components work
|
||||
TRAIN_DATA = [
|
||||
(
|
||||
"Kirby is pink",
|
||||
{
|
||||
"links": {(0, 5): {"Q613241": 1.0}},
|
||||
"entities": [(0, 5, "CHARACTER")],
|
||||
"sent_starts": [1, 0, 0],
|
||||
},
|
||||
)
|
||||
]
|
||||
nlp = English()
|
||||
vector_length = 3
|
||||
train_examples = []
|
||||
for text, annotation in TRAIN_DATA:
|
||||
doc = nlp(text)
|
||||
train_examples.append(Example.from_dict(doc, annotation))
|
||||
|
||||
# Create a ruler to mark entities
|
||||
ruler = nlp.add_pipe("entity_ruler")
|
||||
ruler.add_patterns(patterns)
|
||||
|
||||
# Apply ruler to examples. In a real pipeline this would be an annotating component.
|
||||
for eg in train_examples:
|
||||
eg.predicted = ruler(eg.predicted)
|
||||
|
||||
def create_kb(vocab):
|
||||
# create artificial KB
|
||||
mykb = KnowledgeBase(vocab, entity_vector_length=vector_length)
|
||||
mykb.add_entity(entity="Q613241", freq=12, entity_vector=[6, -4, 3])
|
||||
mykb.add_alias("Kirby", ["Q613241"], [0.9])
|
||||
# Placeholder
|
||||
mykb.add_entity(entity="pink", freq=12, entity_vector=[7, 2, -5])
|
||||
mykb.add_alias("pink", ["pink"], [0.9])
|
||||
return mykb
|
||||
|
||||
# Create and train the Entity Linker
|
||||
entity_linker = nlp.add_pipe(
|
||||
"entity_linker", config={"use_gold_ents": False}, last=True
|
||||
)
|
||||
entity_linker.set_kb(create_kb)
|
||||
assert entity_linker.use_gold_ents == False
|
||||
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
for i in range(2):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
||||
# adding additional components that are required for the entity_linker
|
||||
nlp.add_pipe("sentencizer", first=True)
|
||||
|
||||
# this will run the pipeline on the examples and shouldn't crash
|
||||
results = nlp.evaluate(train_examples)
|
||||
|
|
|
@ -184,7 +184,7 @@ def test_overfitting_IO():
|
|||
token.pos_ = ""
|
||||
token.set_morph(None)
|
||||
optimizer = nlp.initialize(get_examples=lambda: train_examples)
|
||||
print(nlp.get_pipe("morphologizer").labels)
|
||||
assert nlp.get_pipe("morphologizer").labels is not None
|
||||
for i in range(50):
|
||||
losses = {}
|
||||
nlp.update(train_examples, sgd=optimizer, losses=losses)
|
||||
|
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in New Issue
Block a user